How does Neo4j compare to HyperGraphDB

Graph database systems overview and benchmark

Transcript

1 Graph Database Systems Overview and Benchmark Thesis to obtain the academic degree Diploma Computeriker Humboldt-Universitat zu Berlin Faculty of Mathematics and Natural Sciences II Institute for Computer Science submitted by: Benjamin Raphael Gehrels born on: September 10, 1987 in: Heidelberg Reviewer: Prof. Dr. Ulf Reader Prof. Dr.-Ing. Stefan Edlich submitted on: defended on: ......

2

3 Summary Network-like structures are currently experiencing increasing interest in computer science, both in the academic and in the commercial area. The trend to use the social interaction of users and the similarity of their interests to improve online services is probably the most influential factor in the commercial sector. Social networks are one of the most prominent examples here. In the academic sector, for example, the analysis of metabolic networks and Call protein interaction networks. In the last few years a large number of new database management systems came onto the market under the heading NoSQL. They try to better solve certain problems with other than the dominant relational data models and sometimes new query paradigms. In this context, graph database management systems are also enjoying increasing popularity. These use graphs, a mathematical representation of networks, as a data model and promise advantages for the storage and analysis of large networks. This thesis gives an overview of the topic of graph database management systems. It begins with a historical classification and defines the mathematical basis of the data models used. Based on four exemplary systems Neo4j, FlockDB, HyperGraphDB and DEX, a functional comparison of different systems available on the market is then carried out: which data models were selected, which query languages ​​exist, which index structures can be used, how are the graph data persisted, which approaches are used for transaction control supported and how do the systems scale with growing volumes of requests or data? The purely functional analysis is compared with a benchmark of the examined systems, which measures and compares the speed of typical graph queries on graphs of different sizes. This work is concluded with a summarizing classification of the findings: Which concepts and systems prove to be advantageous for which areas of application and where are their problems?

4 Contents 1. Motivation Problems in using relational databases for storing graphs Graph database management systems as an alternative Related Work Objective of the work Historical background Basic terms Graphs Hypergraphs Regular path queries Regular graph queries Graph database management systems Embedded database management system Horizontal and vertical scalability Overview of various graph database management systems Neo4j index structures Inquiry mechanisms Transactions Persistence and caching Scalability and resilience FlockDB query mechanisms Persistence and caching transactions Scalability and resilience HyperGraphDB index structures Query mechanisms Persistence and caching

5 Transactions Scalability and Resilience DEX Query Mechanisms Transactions Persistence and Caching Index Structures Scalability and Resilience Summary Benchmark Data Base Typical Properties of Large Graphs Generation of Graphs The Generated Data Algorithms Neo4j FlockDB HyperGraphDB DEX Experiment Structure Neo4j FlockDB HyperGraphGDB DEX Summary Neo4j FlockDB HyperGraphGraph DEX Method Criticism Results Results of the Individual Algorithms HyperDBX HyperDBX Overall Criteria Results of the Individual Algorithms Outlook A. Bibliography 89 B. Measurement results of the benchmark 96 C. Declaration of independence 120 5

6 List of Figures 5.1. Example of a directed multigraph with edge labels and node properties Example of a labeled and directed multi-hypergraph Schematic representation of a coherent regular graph query with 4 variables and 4 expressions Exemplary representation of a Neo4j database Exemplary representation of a FlockDB database Exemplary representation of a HyperGraphDB database Schematic representation the data storage of HyperGraphDB Exemplary representation of a DEX database The node degree distribution of the generated benchmark graph (nodes, 1 million edges) compared to a power law distribution with γ = 1, visualization of the regular graph expression carried out as part of the benchmark: given the nodes x, which nodes y and z are connected to this via a triangular path with the labels L1, L2 and L3? Measurement results: Import of a graph with a variable number of nodes and edges into different database systems Measurement results: Reading out all edges of a graph with a variable number of nodes and edges with cold caches from different database systems Measurement results: Reading out transitively incidental nodes in a graph with a variable number of nodes and edges with warm ones Caches from different database systems Measurement results: Calculation of the strongly related components of a graph with a variable number of nodes and edges in warm caches from different database systems Measurement results: Reading out common incidental nodes in a graph with a variable number of nodes and edges in warm caches from different database systems Measurement results: Finding triangles in a graph with a variable number of nodes and edges with warm caches from different database systems

7 List of tables 6.1. Overview of the examined database management systems: general information, data model and query options Overview of the examined database management systems: persistence, caching and index structures Overview of the examined database management systems: transactions and distribution B.1. Measurement results: data import in Neo4j B.2. Measurement results: Reading out all edges from Neo4j B.3. Measurement results: Calculation of the strongly connected components with Neo4j B.4. Measurement results: Query transitive incidental nodes on Neo4j B.5. Measurement results: query of common incidental nodes on Neo4j. 100 B.6. Measurement results: Finding triangular paths in Neo4j B.7. Measurement results: data import into FlockDB B.8. Measurement results: Reading out all edges from FlockDB B.9. Measurement results: Calculation of the strongly connected components with FlockDB B.10. Measurement results: Query transitive incidental nodes on FlockDB B.11. Measurement results: Query common incidental nodes on FlockDB 106 B.12. Measurement results: Searching for triangular paths in FlockDB B.13. Measurement results: Data import into HyperGraphDB B.14.Measuring results: Reading out all edges from HyperGraphDB B.15.Measuring results: Calculation of the strongly connected components with HyperGraphDB B.16.Measuring results: Query transitive incidental nodes on HyperGraphDB111 B.17.Measuring results: Querying common incident nodes on Hyper - GraphDB B.18.Measurement results: Search for triangular paths in HyperGraphDB B.19.Measurement results: Data import into DEX B.20.Measurement results: Reading out all edges from DEX B.21.Measurement results: Calculation of the strongly connected components with DEX B.22.Measurement results : Query transitive incident nodes on DEX B.23.Measurement results: Query common incidental nodes on DEX. 118 B.24.Measurement results: Finding triangular paths in DEX

8 listings 1.1. Example of a recursive SQL query: Find all superiors of a person Example of a graph-based query language (Cypher): Find all superiors of a person Example of a read-only Cypher query Example of a complex selection query in FlockDB [Gehrels, 2013] SQL schema of the FlockDB edge tables SQL - Scheme of the FlockDB node tables

9 1. Motivation Many structures in the real world can be interpreted as graphs, a mathematical representation of networks. This is evident in road networks, relationships between people or the reaction pathways of biochemical components, as is the case with tree structures, for example hierarchies, ontologies or parts lists in industry, in which individual products are each made up of preliminary products. But graphs are also built in object-oriented programming: object instances can be interpreted as nodes and their relationships as edges. If you want to process graphs automatically, they usually have to be saved and read out efficiently. Database systems can be used for this. Problems with the use of relational databases for storing graphs A class of database management systems that is widespread in practice is based on the relational data model developed by Codd [1970]. This is based on the concept of relation, a set of tuples of the same length and the same domain sequence. Relationships between different tuples are only represented by the fact that one of the tuples contains the identifying key of the other tuple as an attribute (foreign key). With joins, the tuples of several relations can be linked to form a new relation; Joins are very computationally expensive, although the effort can be reduced by using index structures. The relational model thus focuses on the entities themselves (in the form of tuples) and their attributes (in the form of elements of the tuple). These result explicitly from the data model. Relationships between entities, on the other hand, are only represented implicitly via the equality of attribute values. When navigating in large road networks, for example, the focus is on the routes between the locations and less on the locations themselves. In the case of biochemical reaction paths, the reaction chain is the interesting part, not the individual substances involved in isolation. In such models, the important information lies in the structure of the modeled graph, not in the properties of its 9

10 WITH RECURSIVE manager (level, managerid) AS (SELECT 1 AS depth, employees.managerid AS managerid FROM employees WHERE employees.employeeid = 13 UNION ALL SELECT manager.depth + 1, employees.managerid FROM manager INNER JOIN employees ON employees.employeeid = manager.managerid) SELECT manager.depth, manager.employeeid FROM manager ORDER BY manager.depth ASC; Listing 1.1: Example of a recursive SQL query: Find all superiors of a person node. It therefore makes sense to use a data model for storing such data that focuses on these relationships. According to Iordanov [2010] there are three typical query patterns on graphs. Only one of them, set-based queries, is well covered by the widely used relational query language SQL [1999]. The other two query patterns show the discrepancy between the focus of the relational data model and the questions of graph-oriented problems. Thus, traversing queries, starting from one or more nodes, follow the edges of the graph; an example of this is the search for the shortest paths in a graph. Graph pattern matching searches the graph for certain patterns that are formed by edges and their incident nodes, for example the search for triangular relationships and the people involved in a social network. Traversing queries are possible in SQL [1999], but the consequences of an unsuitable data model become apparent here: Using the non-optional components of SQL [1999], such queries are only possible if the exact traversing depth is known. Even then, they have to be calculated using computationally expensive JOIN cascades. SQL [1999] does specify a possible solution in the form of an optional feature

11 START employee = node (13) MATCH manager [path: manages]> employee RETURN manager.id, length (path) AS depth ORDER BY depth ASC Listing 1.2: Example of a graph-based query language (Cypher): Find all superiors of a person res T131 (Recursive Common Table Expressions), but its semantics are also set-oriented. Listing 1.1 shows a simple example of such a recursive SQL query: It selects all direct and indirect superiors for a given person (the person with ID 13) along with the respective hierarchy level. Even with such an intuitively simple request, a quite complex request construct is necessary. It is particularly noticeable that the declarativity that characterizes SQL is largely lost here. Instead of formulating the problem in SQL as with quantity-oriented queries and letting the database management system find the solution, the algorithm is already outlined here. In addition, the original intention of this SQL query is difficult to recognize. Graph database management systems as an alternative Graph database management systems can represent an alternative here: They place the relationship between entities in the focus of the data model. The links do not result implicitly from attribute values, but are an explicit part of the data. Depending on the system, edges can be directed or undirected, nodes and edges can be labeled and nodes or edges can be annotated with attributes. Graph database management systems also offer partially graph-based query languages ​​and have the option of optimizing data storage for typical graph query patterns. Graph-based query languages ​​are for example SPARQL [Prud hommeaux and Seaborne, 2008], Gremlin [Rodriguez, 2013] or Cypher [Neo Technology, 2012, 14]. Listing 1.2 shows a Cypher query that calculates the same result as the SQL query from Listing 1.1. For this purpose, the hierarchy is modeled as a directed tree, with an edge labeled MANAGES from each manager leading to his immediate employees. Cypher will be discussed in more detail in section. Since the query language follows the same concept (graph) as the problem to be solved, the problem can also be expressed directly in Cypher code. This makes queries easier to read, easier to write and potentially more flexible to optimize. Graph database systems also offer the opportunity to save data for their 11th birthday

To optimize the purpose of use: Since the relationships between the objects are explicitly given, these could also be persisted as a direct link, for example. Linked entities could also be persisted in groups so that they can be loaded more efficiently. 12th

13 2. Related Work Rodriguez and Neubauer [2010a] offer a good overview of various possibilities for semantically enriching graphs in order to use them as a data model in a database. Attributes, labels, weightings, etc. They also give a brief introduction to the field of graph database systems. The work by Angles and Gutierrez [2008] offers a now historical, but very broad and in-depth analysis of the literature on graph data models and graph query languages ​​back to the year. They show not only the models themselves, but also the lines of development and historical relationships. Two approaches to graph queries are often used in current graph database systems: On the one hand, traversing queries. Rodriguez and Neubauer [2010b] have published a good overview work on this, which also offers a mathematical formalization of this concept, which is often only roughly circumscribed in the literature. On the other hand, regular path queries, which were introduced by Mendelzon and Wood [1989] and are considered in more detail in Section 5.3. The work by Buerli [2012] gives an overview of areas of application of graph database systems. In addition, with a categorization of 18 graph database management systems as pure, distributed, key-value-based, document-based, SQL-based and map-reduce-based database management systems, each with a summarizing paragraph, it offers an excellent introduction to the field. Goes a little more in depth [Angles, 2012], but because of the brevity of this work, it only offers a rough overview of 9 different graph database management systems: The work categorizes these based on their data models, their ability to define integrity constraints and the suitability of their query mechanisms for certain graph requests. Various authors have dealt with benchmarks of graph database systems both theoretically and practically. Leskovec et al. [2008] gave a good overview of the essential properties of existing large graphs, Chakrabarti and Faloutsos [2006] give an overview of a large number of algorithms for generating synthetic ones with such properties. They also give an overview of the historical development of this field of research. Dominguez-Sal et al. [2011] published an excellent discussion on the design of graph database system benchmarks on this basis. They span the spectrum from the analysis of motivating use cases of graph database systems to the characteristics of meaningful test data and selection criteria 13

14 for benchmark algorithms through to recommendations for the test setup and measurement methodologies. Benchmark suites for object-oriented database systems and thus for object graphs have, inter alia, Cattell and Skeen [1992] and Carey et al. [1993], Guo et al. [2005] define a benchmark suite for RDF graphs. Another benchmark suite, which is originally geared towards graph processing, was developed by Bader et al. [2009] defined. Vicknair et al. [2010] benchmarked Neo4j and MySQL against each other. They come to the conclusion that Neo4j is the faster choice for traversing queries and, by using Lucene as an index solution, also for full-text queries, while MySQL is the faster choice for queries that count entities based on certain attributes. Holzschuher and Peinl [2013] benchmarked various query options from Neo4j, both against each other and against MySQL / JPA.They also see Neo4j as having an advantage over MySQL for traversing queries, but at a disadvantage for queries that only read out individual nodes in full. In addition, there are a few benchmarks that were carried out by developers of graph database systems and, unsurprisingly, each turn out their own system to be advantageous: Ciglan et al. [2012], who compare DEX, Neo4j, SAIL, OrientDB and their prototypes SGDB on the basis of traversing queries, Dominguez-Sal et al. [2010], who compare their database management system DEX with Neo4j, Jena and HyperGraphDB, Martínez-Bazan et al. [2012], who compare DEX with MonetDb, MySQL and Neo4j from the same institute. The fact that all of this work was published by the respective developers does not mean that this work necessarily has a methodological bias. However, it shows that there are hardly any manufacturer-independent benchmarks that compare graph database systems with one another. 14th

15 3. Aim of the thesis The existing comparative literature on graph database systems is mainly limited to an abstract level and compares data models and the existence of different types of query languages. A more in-depth analysis of the various systems, in particular with regard to the data structures used and their optimization for certain query mechanisms, is largely lacking. At the same time, there are hardly any manufacturer-independent benchmarks that examine the effects of these design decisions on the speed of the systems. This work tries to fill this gap. For this purpose, various graph database management systems selected as examples are presented first. This includes introductory the graph models used and the query options. This is followed by a look at the technical implementation: Which approaches were chosen to persist graphs? Which caching methods were chosen? How do the systems deal with concurrent access? What options do the systems offer to react to increasing amounts of data, increasing write loads or increasing read loads? This work is supplemented by a speed measurement of the database systems using different query patterns on graphs of different sizes. This should make visible the effects of the design decisions on which the individual systems are based on the performance of the overall system. Using the example of four exemplarily selected systems, it is worked out in which application areas which of the selected approaches offer advantages and where their disadvantages lie - a broader investigation of all graph database management systems available on the market would go well beyond the scope of a diploma thesis. Neo4j [Neo Technology, 2012] stands out from the existing systems because of its powerful and diverse query options. It links data records on the persistence level and is therefore also an example of explicitly graph-based data persistence. DEX [Martínez-Bazan et al., 2007] represents the opposite extreme: It has only very minimalist query options. DEX is based on B-trees and bitmaps for data persistence and is therefore exemplary for graph-based database management systems that do not persist any direct link between entities. 15th

16 HyperGraphDB 1.2 [Iordanov, 2010] is based on the data model of a hypergraph and thus provides the most expressive data model among the currently available graph database management systems. FlockDB 1 [Pointer et al., 2010] takes a completely different approach: It merely represents a graph-based mediator between the application and a relational database management system, which is addressed via SQL [1999]. It only supports quantity-based queries. In addition, FlockDB is the only system under consideration that is explicitly designed for high, concurrent write loads. These four systems are particularly suitable as a random sample to form an overview of the existing concepts, some of which differ greatly. In addition, they are available free of charge, at least for academic use. Further graph database management systems, which were not considered in this thesis, are OrientDB, InfoGrid, Objectivity InfiniteGraph and AllegroGraph. In addition, there are VertexDB, which apparently is no longer actively developed 2 and GraphDB, which is still for sale today as part of the bankruptcy estate of its developer, Sones GmbH 3. At least part of GraphDB is online under AGPL 4. Also in this Systems for the distributed analysis of very large graphs such as Google Pregel [Malewicz et al., 2010] and its open source implementation Apache Giraph Ching et al. [2012] or GraphLab [Gonzalez et al., 2012]. These do not represent graph database management systems in the sense of this thesis, since they were not designed for the continuous management of data, but for the occasional analysis of large amounts of data. 1 The version of the Git-Master-Branch on Github.com from April 28th, 2012 is being considered: 2 The GIT-Repository of the project only records one commit in the last 2 years: accessed on March 13th Telephone information from the insolvency administrator, Dr . Oliver Hartig, March 12th. Located at: accessed March 13th

17 4. Historical Background In the last few years a large number of new graph database systems have come onto the market. However, the idea of ​​using network structures as a data model for database management systems goes back a long way. Charles Bachman designed the Integrated Data Store, one of the first general-purpose database systems, in the mid-1960s. It was based on the data model of a directed, cycle-free graph of records and pointers, which was later (1969) specified as a network model (also CODASYL data model) by the Conference on Data Systems Languages, IBM published the Information Management System, which is based on the data model of a directed tree based, called hierarchical model [Singh, 2009, 1.9]. The relational model proposed by Codd [1970] subsequently increasingly replaced the network and hierarchical model after the introduction of SQL / DS by IBM in the early 1980s. Singh [2009, 1.9] attributes this to the simplicity of the relational model, the ease of use of the relational database systems even by non-programmers using SQL as the query language and the declarative abstraction of implementation details of the database management system inherent in SQL. The idea of ​​using graph databases as general purpose databases arose in the early 1990s. Amann and Scholl [1992] describe a growing interest in graph models and languages ​​in current database research 1, especially for hypertext and geographic information systems. You design both a data model based on a directed labeled multigraph and a query language for it. In the late 1990s, object-oriented database management systems were increasingly developed [Singh, 2009, 1.9]. The Object Definition Language [Cattell et al., 2000] defined by the Object Data Management Group is based on a graph as an abstract data model, with objects as nodes and relations between them as edges. The inheritance hierarchies between classes are also graphs. With the emergence of the Semantic Web as a research area around the turn of the millennium, the Resource Description Framework [Klyne and Carroll, 2004] 1 Own translation. 17th

18 standardized as a data model. It describes facts as subject-predicate-object triples. With subject and object as nodes and predicates as edges, it forms a graph. Along with this, a new class of database systems was created, so-called triple stores. SPARQL [Prud hommeaux and Seaborne, 2008], probably the most prominent query language for semantic web data, is also designed as a graph query language. At the end of the last decade, a large number of new database systems came onto the market under the name NoSQL. What they have in common is that, apart from the relational model, they are looking for new ways of solving certain problems of data storage and query processing more efficiently. For this purpose, in contrast to the relational model, schema-less approaches were often chosen to store structured and semi-structured data. The interest in the analysis of very large graphs, be it in the area of ​​social, technical, semantic or biological networks, also led to a growing interest in graph database systems [Angles and Gutierrez, 2008, 2.4]. They also provide a good basis for the increasingly common recommendation systems, especially in the e-commerce sector [Rodriguez and Neubauer, 2010b, 3.1]. At about the same time, especially in the area of ​​these recommendation systems, the distillation of recommendations from sometimes enormous and growing amounts of transaction data became a challenge of great interest. It is often necessary to distribute these calculations over several systems. This trend is increasingly referred to as big data. In this context, Apache Hadoop [Apache, 2008], which is based on the MapReduce paradigm previously published by Dean and Ghemawat [2004], became popular. MapReduce's inadequacies in analyzing graph data subsequently led to the development of systems such as Google Pregel [Malewicz et al., 2010], Apache Giraph [Ching et al., 2012] and GraphLab [Gonzalez et al., 2012]. 18th