execution time analysis of electrical network tracing in ...1304968/fulltext01.pdf · tion systems...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2019

Execution Time Analysis of Electrical Network Tracing in Relational and Graph Databases

FELIX DE SILVA

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Execution Time Analysis ofElectrical Network Tracing inRelational and GraphDatabases

FELIX DE SILVA

Master in Computer ScienceDate: March 16, 2019Supervisor: Mika CohenExaminer: Mads DamSchool of Electrical Engineering and Computer Science

iii

Abstract

In today’s society, we handle a lot of connected data. Examples arecompanies like Facebook and Amazon, that handle connected data indifferent ways. Geographic Information Systems and Network Informa-tion Systems handle connected data in the form of networks or graphsthat can represent anything from an electrical network to a productnetwork.

When it comes to connected data, the most commonly used databasetechnology is relational databases. However, with a lot of new databasesemerging, there may be better alternatives for connected data that canprovide higher performance.

In this study we look at the Oracle relational database and the Neo4jgraph database and study how both databases traverse an electricalnetwork. The findings indicate that the Neo4j graph database outper-forms the Oracle relational database regarding execution time of searchqueries.

iv

Sammanfattning

I dagens samhälle hanterar vi mycket kopplad data. Exempel är företagsom Facebook och Amazon, som hanterar kopplad data på olika sätt.Geografiska informationssystem och nätverksinformationssystem han-terar kopplad data i form av nätverk eller grafer som kan representeraallt från elnät till ett produktnätverk.

När det gäller kopplad data är den mest använda tekniken relationsda-tabaser. Men med många nya databaser som kommer fram kan det nufinnas bättre alternativ för kopplad data som kan ge högre prestanda.

I denna undersökning tittar vi på relationsdatabasen Oracle och graf-databasen Neo4j och undersöker hur båda databaserna traverserar ettelnät. De presenterade resultaten visar att grafdatabasen Neo4j utförgraftraversering snabbare än relationsdatabas Oracle, där fokus liggerpå körningstid.

Contents

1 Introduction 11.1 Problem Background . . . . . . . . . . . . . . . . . . . . . 11.2 Research Question . . . . . . . . . . . . . . . . . . . . . . 31.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theoretical Background 52.1 Relational Databases . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Tables and Keys . . . . . . . . . . . . . . . . . . . . 62.1.2 Index . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.3 Stored Procedure . . . . . . . . . . . . . . . . . . . 92.1.4 Query Processing . . . . . . . . . . . . . . . . . . . 92.1.5 The Oracle Relational Database . . . . . . . . . . . 11

2.2 Graph Databases . . . . . . . . . . . . . . . . . . . . . . . 122.2.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.2 Index-Free Adjacency . . . . . . . . . . . . . . . . 142.2.3 Query Processing . . . . . . . . . . . . . . . . . . . 152.2.4 The Neo4j Graph Database . . . . . . . . . . . . . 17

2.3 Database Benchmarking . . . . . . . . . . . . . . . . . . . 182.4 Database Modeling . . . . . . . . . . . . . . . . . . . . . . 19

2.4.1 Relational Modeling . . . . . . . . . . . . . . . . . 202.4.2 Graph Modeling . . . . . . . . . . . . . . . . . . . 21

2.5 Query Execution Time Estimation . . . . . . . . . . . . . 212.5.1 Access Time . . . . . . . . . . . . . . . . . . . . . . 222.5.2 Storage Time . . . . . . . . . . . . . . . . . . . . . 222.5.3 Computation Time . . . . . . . . . . . . . . . . . . 232.5.4 Communication Time . . . . . . . . . . . . . . . . 23

v

vi CONTENTS

2.6 Database Storage . . . . . . . . . . . . . . . . . . . . . . . 23

3 Related Research 253.1 Benchmarking Database Systems for Social Network Ap-

plications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 The Shortest Path Algorithm Performance Comparison

in Graph and Relational Database on a TransportationNetwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Relational Database and Graph Database: A Compara-tive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Comparative Analysis of Relational and Graph Databases 28

4 Methodology 304.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3 Benchmark Framework . . . . . . . . . . . . . . . . . . . . 33

4.3.1 Query . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Results 365.1 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . 365.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . 385.3 Standard Deviation . . . . . . . . . . . . . . . . . . . . . . 39

6 Discussion 406.1 Benchmark Comparison . . . . . . . . . . . . . . . . . . . 406.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . 416.3 Execution Plan Analysis . . . . . . . . . . . . . . . . . . . 416.4 Practical Applications . . . . . . . . . . . . . . . . . . . . 44

7 Conclusion 467.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Bibliography 48

A Neo4j Graph Creation Algorithm 51

B Cypher BFS Complete Search 52

C Cypher BFS Stop-Label Search 53

CONTENTS vii

D Result Data - Complete Search 54

E Result Data - Stop-Label Search 55

Chapter 1

Introduction

In this chapter we present the research question along with theobjective of this thesis. The limitation of this thesis is presentedin the scope and purpose of this project are thereafter clarified.

1.1 Problem Background

We live in a world where information exists everywhere and informa-tion storage is an essential part of society. Information such as e-mailsand personal information needs to be stored in easily accessible andefficient ways. Today, storage of information can be either in physical ordigital form; for computerized devices, digital formats like file systemsand databases are the most efficient ways.

The development of traditional SQL databases has been ongoing sincethe 1980’s [19] and has laid the foundation of modern databases. NoSQLand NewSQL are prominent database types that are today widely used.There are many aspects to consider in selecting database type for stor-age in a system or application, such as the property of connectivity.

Everything that can be abstracted into a graph or network has theproperty of connectivity. The Internet with its hyperlink network, Face-book with its social network and Amazon with its product networkare examples of large information networks built from data with highconnectivity. When data are connected in a database in such a manner,it can be described as relationships between data points.

1

2 CHAPTER 1. INTRODUCTION

In relational databases, the data structures used for data storage aregrid-structured tables. Connectivity among data here means that adata cell from a table refers to a data row in the same or another table.Accessing the data points in such relations can be done through theuse of SQL JOIN operations, merging the tables to allow access of thesedata points. When datasets become more interrelated, carrying outqueries can be more complex because of the possible need for moreJOIN operations.

Native graph databases use first-class citizen nodes for storing data andfirst-class citizen relationships are used for connecting the nodes [10],creating a graph structure. Contained within each node is a list of rela-tionship records that represent the node’s relationships to other nodes.When carrying out queries similar to JOIN operations, the databaseuses these lists and has direct access to the connected nodes, eliminatingthe need for a time consuming computation like in relational databases.

There are many reasons as to why connectivity and JOIN operationshave such a complicated relationship. One reason is the underlyingarchitectures of the database types, in how the inner mechanics ofhandling and enabling connectivity works. According to Harrison [9],native graph databases are built with primary focus on connectivity,which is not the case for relational databases. The choice of databasetype therefore varies depending on application and what aspects areprioritized.

Digpro is a company that deals with Geographic Information Technolo-gies. They develop and provide software in the form of GeographicInformation Systems (GIS) and Network Information Systems (NIS).dpSpatial is a platform developed by Digpro that lays the foundationof all of Digpro’s product applications. It utilizes an Oracle relationaldatabase for a range of functionalities, mostly for storing and retrievingdata and executing search queries.As mentioned before, Digpro are using a relational database whereconnectivity is not the main focus which may affect the executiontime for search queries. By using a alternative database which hasconnectivity as its main focus, such as native graph databases, mayenhance the performance of dpSpatial.The Neo4j graph database, [20] is of specific interest for Digpro, because

CHAPTER 1. INTRODUCTION 3

it is widely used and optimized for connected data [19][10]. Digprohandles highly connected infrastructural electrical networks with dif-ferent layers representing various levels of structures, cables and cablehousing. Due to the complexity of the structured data with respectto connectivity, it is possible that the Neo4j graph database can be aneffective solution in handling search queries. This possibility is furtherexplored in this project.

1.2 Research Question

How does a native graph database affect the execution time of searchqueries in an interconnected electrical network, compared to a relationaldatabase?

1.3 Objective

This thesis aims to investigate whether a native graph database is abetter alternative than a relational database, concerning the executiontime of search queries requiring retrieval of connected data. In thisstudy, the concept of execution time is interpreted as the time fromwhen a query is sent until a response is received from the database.

1.4 Purpose

This study explores the improvement possibilities of search queries interms of execution time by replacing relational databases with nativegraph databases. If this study shows improvements this may be rele-vant for companies handling connected data and may aid in furtherimproving their software by exploring native graph databases.

1.5 Scope

The databases that are studied in this thesis are the Oracle relationaldatabase and the Neo4j graph database. Specifically, it is investigatedwhether Neo4j has a shorter execution time when traversing an inter-connected electrical network than Oracle. The underlying models of

4 CHAPTER 1. INTRODUCTION

the electrical networks are in this study identical for both databases.

In this study, we compare Neo4j and Oracle in terms of the databaseinternals, the underlying architecture and the algorithms. Externalfactors such as system latency are not within the scope of this study,due to both databases operating on machines with similar hardware(e.g. SSD, CPU and RAM).

1.6 Terminology

Expression Abbreviation Definition

Oracle Oracle A relational databaseRelational Database developed by Oracle Co.

Neo4j Neo4j A native graph databaseGraph Database developed by Neo4j Inc.

Relational Database RDBMS An application that allowsManagement System a user to manage

a relational database.

Structural SQL A programming languageQuery Language to accesses data in a

relational database

Database DBB An application thatBenchmark measures the performance of

database

Table 1.1: Short definitions of relevant expressions.

Chapter 2

Theoretical Background

This chapter covers in-depth background into relational databasesand graph databases as well as database benchmarking, databasemodeling, time estimation and database storage.

2.1 Relational Databases

The relational model was first introduced by Ted Codd of IBM in the1970’s [14][8]. This model has its theoretical foundation in set theoryand first-order predicate logic. It has given birth to the relational alge-bra, which is the fundamental logic behind relational databases.

A relational database is filled with collections of tables. Each collec-tion is represented as a separate relational database unit. A RelationalDatabase Management System (RDBMS) is an application that allows auser to manage access to the relational database[13].

Accessing the collection of data that resides within an RDBMS can bedone in different ways. The standard way of accessing data is througha non-procedural declarative domain-specific programming languagecalled Structural Query Language (SQL). The SQL functionality is di-vided into three main parts:

• SQL schema statements - The creation of data structures.

• SQL data statements - The manipulation of data.

• SQL transaction statements - The management of transactions.

5

6 CHAPTER 2. THEORETICAL BACKGROUND

All of these parts make SQL an efficient language for the relationaldatabase model [5].

JOIN operations are part of the SQL language handles data manipu-lation. These operations combine rows from different tables based onrelated column names to form a new table. There are several types ofJOIN operations, but the only concerned type in this project is calledinner-JOIN. This JOIN operation can be used to combine two differenttables, but when it is used to combine a table with itself it is called aself-JOIN.

The following subsections cover relational tables along with techniquesto improve execution time (i.e. index and stored-procedures) and theprocess behind query processing. Lastly, the Oracle relational databaseis introduced.

2.1.1 Tables and Keys

Most relational databases provide a database management system thatcontains a variety of data structures and fundamental properties. Thesestructures and properties allow for the RDBMS to create database ab-stractions for many types of systems.

Tables

Tables are the primary data storage structure of relational databases,where all data about a certain kind of entity are kept. A table consists ofrows and columns. The table and column names are used to interpretthe meaning of the data in each row. A simple example is a table calledPERSON, where a PERSON entity (i.e. a row in the table) has a name,age and occupation (see Figure 2.1). These attributes are the columnnames of the PERSON table. When a PERSON entity is added to thetable the data values are stored under the corresponding columns torepresent that specific PERSON entity.

CHAPTER 2. THEORETICAL BACKGROUND 7

Figure 2.1: A PERSON table with the attributes name, age and occupa-tion.

Keys

A key consists of one or more attributes that serve as a unique identifierof a row in a table [8]. Keys can come in different forms, but the mostcommon ones are primary keys and foreign keys.

A primary key is defined as a set of columns of a table that uniquelyidentifies a row in a table. With primary keys, the database ensures thatthe rows have unique, non-duplicated and non-NULL values in thosecolumns. A foreign key is defined as a column in a table that referencesa primary key in another table. Foreign keys, in contrary to primarykeys, accept NULL values and duplicate values. For example, seeFigure 2.2 where we have a table PERSON and a table OCCUPATION.In the PERSON table we have a person called Johan and his occupationis the number 2. Here, the number 2 is a foreign key referencing the IDvalue 2, which is a primary key in the OCCUPATION table.


Figure 2.2: A representation of a primary-foreign key relationship.

2.1.2 Index

Indexes are used to speed up searching for data in a table, known as alook-up, in a database. They can be used to efficiently find all matchingrows given some attribute in a search query and then search throughonly a subset of the table to find exact matches. If a table does not havean index on any column the database has to search through the entiretable and check every row to find matches, which is a slow operationon larger tables [8][13].

Clustered Index

A clustered index can be applied to a column in a table, by which therows in the table will then be sorted. An example is the table of contentsin this paper, where the enumerations of the chapters and sections serveas the clustered index. Clustered indexes are suitable for retrieving alot of data for range-based queries since all data are located next to eachother. A clustered index also reflects how data is stored in memory,which limits the database to one clustered index per table [8][13].

Non-Clustered Index

A non-clustered index is different from a clustered index, in that atable can have many non-clustered indexes pointing to data cells inthe table [8][13]. The index pages at the end of a book is an exampleof how non-clustered indexes work; they point to the places where a


sought word is. For instance, when searching for the word "Mom", thenon-clustered index points to the places in the table where the word"Mom" is stored. This is convenient for larger tables in order to retrievespecific information that would otherwise require searching the entiretable for matches, thus streamlining the execution time.

2.1.3 Stored Procedure

Stored procedures are a set of pre-compiled statements used to performa particular task. These procedures are parsed and optimized before thefirst execution and compiled versions of the stored procedures remainin memory for later use. This means that a stored procedure does notneed to be re-parsed or re-optimized for each execution and this resultsin shorter execution times.

A benefit of stored procedures is that the developer of a procedure cancentralize data-access logic into a single place. That is, the procedureitself handles the relevant data-access internally and executions of theprocedure only need to be concerned with the output. With storedprocedures the developer can specify access permissions to specificprocedures, adding another level of security. This means that a userdoes not need to have read/write permissions on the underlying tablesof the procedures. Furthermore, they allow for more structure and orga-nization in larger SQL segments by removing repeated SQL segmentswith stored procedures.

2.1.4 Query Processing

In the Oracle relational database used in this study, the query processinghas four steps. As described by Elmasri [8] and Oracle [17], these stepsare:

1. Parsing

2. Optimization

3. Query plan generation

4. Execution


The way queries are processed affects the execution time. The follow-ing paragraphs explain the steps in more detail and allow for a deeperunderstanding of the execution plan analysis in Chapter 6

Parsing

The parsing phase of the query processing is to make sure that thequery is valid before going to the next phase. During the parsing phase,the query is passed through three different checks:

1. Syntax checkLooks at the query and checks that it is syntactically correct ac-cording to SQL grammar.

2. Semantic checkLooks at references to the database objects and checks that hostvariables have correct data types and are valid.

3. Shared pool checkLooks at the query and checks if there is a need for some resource-intensive steps of the query processing. The check consists oflooking through the shared pool (i.e. a shared memory space)and trying to find previously parsed statements. If the checkdetermines that a statement in the shared pool exists, then thedatabase performs semantic checks to determine whether thestatements are semantically equivalent.

Optimization

The optimization phase of the query processing generates multipleexecution plans to then choose the one with the lowest cost for the nextphase. The optimizer uses collected statistics to calculate a cost for asubquery execution plan. SQL is a non-procedural language, whichmeans that the optimizer can reorganize the subqueries in any order.The cost computation takes into account factors of query executionsuch as I/O, CPU and communication, which is an internal unit thatthe optimizer uses for plan comparisons.

The optimizer determines the optimal execution plan for a subqueryby examining multiple access methods, such as full table scan, index


scans and different JOIN operations. After calculating the total cost foreach execution plan, the optimizer chooses the plan with the lowestcost estimate. The selected execution plan is a recommendation of anexecution method for the SQL statement. However, the execution planis not usable for the database, but solely for the optimizer. The next stepfollows this up by formulating the execution plan as a usable queryplan for the database.

Query Plan Generation

The query plan generation phase formulates a query plan that is usablefor the database. The query plan itself takes the form of an iterativeplan with several steps. Each step returns a row source, which is eitherused in the next step or returned to the application that initiated theSQL statement. A row source is a set of rows returned by a step inthe query plan along with a control structure that helps processing therows.

Execution

The execution phase uses the query plan generated in the query plangeneration phase. During the execution phase, the SQL engine goesthrough the query plan and executes each row source. If the data arenot stored in memory, then the database reads the data from disk tomemory and then executes. The database makes sure to preserve theintegrity by releasing internal locks and logging any changes madeduring the query execution.

2.1.5 The Oracle Relational Database

The Oracle relational database is an RDBMS from the Oracle Corpora-tion. It has been in development for the last 35 years. The first commer-cially available version, called Oracle version 2, was introduced in 1979and was the first SQL-based RDBMS. The system is built with a rela-tional database framework were data objects can be directly accessedby users (or an application) through SQL statements. Today, the Oraclerelational database is one of the most used RDBMS and is often usedby global enterprises. [17]


2.2 Graph Databases

Graph databases come from the NoSQL family of database systemsthat have their foundation in graph theory [18]. In a graph database,an object or system is represented by a mathematical graph. Figure 2.3is an example of a graph sketch of a person, where the person has aname, age and occupation. The different elements to create a graph-likeabstraction is further explained in section 2.2.1.

Figure 2.3: A sketch of a graph representing a person and some of itsattributes.

Accessing data that reside in a graph database can be done using dif-ferent languages. Since graph databases are quite new in comparisonto relational databases, there has not been any widespread standard-ization regarding a graph query language [1]. However, SPARQL,Gremlin, GraphQL and Cypher are graph query languages that manygraph databases have adopted, with Cypher being the most popularone due to its resemblance to SQL [12]. Cypher is a declarative graphquery language, created by Neo4j Inc., which allows for expressiveand efficient creation of graphs as well as manipulation of graph data.Cypher’s code aesthetics resembles ASCII art and SQL, making it rela-tively easy to formulate bidirectional queries for complicated systems[20].The following subsections cover graphs in detail followed by Index-free Adjacency and graph query processing. Lastly, the Neo4j Graphdatabase is introduced.


2.2.1 Graphs

Graphs are created with two core data structures: the nodes where thedata can be stored and the relationships that represents the connectionsbetween the nodes. The way the data is stored in each node dependson the graph database itself. A node can represent either an entity oran attribute depending on the data key-value mappings within. Thedata that reside within a node is schema-less, meaning each individualnode can have its own number of data mappings. If a node containsone key-value mapping it could represent an attribute such as NAMEin 2.3. However, if a node contains several key-value mappings it couldrepresent a entity such as PERSON in 2.4. We can also distribute labelsto nodes to help classify them where labels are used in a similar fashionto table names; they describe which collection a node belongs to. Forexample figure 2.5 show two node having the same label (PERSON),meaning they are classified as the same type of node [18][9].

Figure 2.4: An node representing a PERSON.

The other core data structure for forming graphs is relationships. Asmentioned previously purpose of a relationship is to represent a con-nection between nodes. A relationship can also be classified by a labelin a similar way as a node. Figure 2.5 shows several nodes labeled PER-SON and COMPANY and their corresponding data, theses nodes areconnected via different labeled relationships (:Friends, :WorksAt). Inaddition, a relationship can also hold data and are mapped as key-valuepairs [18][9].


Figure 2.5: An illustration of relationships between nodes.

2.2.2 Index-Free Adjacency

There are two main families in graph databases: native and non-nativegraph databases where the underlying storage mechanism can vary.Some non-native graph databases depend on a relational storage andstore the graph data in relational tables, imposing another level ofabstraction between the graph database, its management system andthe physical devices where data are stored. Others use a key-valueor document-oriented database for storage, making them inherentlyNoSQL databases.

Native graph databases, in contrast, are graph databases that are specif-ically built for graph handling [18]. Its storage, memory managementand query engine all support index-free adjacency. With index-freeadjacency, each node has a direct reference to its adjacent node. Thismeans a look-up has a constant time complexity O(1), as opposed to a(global) index look-up typically having the time complexity O(log(n)),where n is the number of nodes in the graph [18][9]. With the propertyof index-free adjacency, traversing a graph operates like pointer-chasing[22].


Figure 2.6: An illustration of an graph database with relational storageand native graph database with graph storage.

2.2.3 Query Processing

Queries are processed differently in different graph databases. In theNeo4j graph database used in this study, the processing of a query untilit is ready for execution has several steps [18][9][20]:

1. Parsing

2. Query graph generation

3. Logical plan generation

4. Optimization

5. Execution plan generation

6. Execution

These steps in themselves may contain several underlying steps. Thesteps in Neo4j’s query processing have similarities to the query pro-cessing in Oracle; both have a general pattern of parsing, optimizationand execution. However, due to that Oracle uses SQL and Neo4j usesCypher the similarities stop there. The following paragraphs explainthe steps in more detail and allow for a deeper understanding of theexecution plan analysis in Chapter 6.


Parsing

The parsing phase of the query processing has an end goal of creating anormalized Abstract Syntax Tree (AST). For that to happen, the queryneeds to be syntactically validated, tokenized and parsed into an AST.Then, using the AST, a semantic check is performed to validate types,scoping and binding of variables.

At this point, the AST is both syntactically and semantically validated.Now the parser only needs to normalize the AST. This is done byrewriting the AST in different ways. Some of these revisions could belabel reorganization, type reorganization, redundancy suppression oralias expansion.

Query Graph Generation

The query graph generation phase uses the normalized AST to generatea query graph, which is a more high-level abstraction of the query. Thisquery graph allows for more high-level optimization in further steps.

Logical Plan Generation

The logical plan generation uses the query graph to generate logicalplans. A logical plan is plan that is used internally to determine whichoperations to use in future steps. A query graph can contain severalsubquery graphs; a logical plan is produced for each of the subgraphsin a step-by-step bottom-up fashion.

At each step, using information from the query graph, the phase esti-mates the number of matching nodes using previously collected statis-tics and then uses these to estimate the cost of building a preferredlogical plan. The cost of a logical plan is a statistical estimation of howmuch work the database needs to do. Usually, this cost mainly takesaccount for I/O reads from storage, in-memory computations and com-munications.

Building the logical plan bottom-up is done using Iterative DynamicProgramming (IDP). It makes sure that each individual step is the mostcost effective [16]. However, the cost effectiveness is built on estimatesand is therefore subjective to errors. This can occur in at least three


different ways:

• Statistical inputs can be incorrect because sampling is not perfect.

• Combining costs, the phase does not know how the costs correlatewhich can produce wrong estimates.

• If search space grows too much between steps, the phase willprune parts of the logical plan, which may lead to some plansnever being evaluated.

At the end of the logical plan generation phase, the preferred costeffective logical plan is selected for the next step, being the optimizationstep. The selection of the most effective plan is made using IDP asdescribed above. However, this selection is not guaranteed to be themost optimal plan.

Optimization

During the optimization phase, the logical plan is improved as much aspossible. The process consists of running several improvement checkson the plan and applying suitable operations according to the results ofthese checks. Examples of such operations are unnesting and merging.

Execution Plan Generation

The execution plan generation phase uses the optimized logical planand generates an execution plan. This is done by constructing anexecution tree which consists of internal operations where each non-leafnode gets information from one or two child nodes. The informationthat is being sent between operators in the intermediate states of thetree is the attributes needed and their types.

Execution

The execution phase uses the execution plan tree and executes eachoperation in a pipeline until the final result is obtained.

2.2.4 The Neo4j Graph Database

The Neo4j graph database is a Java open-source native graph databasedeveloped by Neo4j Inc. It has been on the market since 2007 and iscurrently the world’s most used graph database [20].


2.3 Database Benchmarking

A database benchmark (DBB) is a set of tasks sent to databases toevaluate their relative performances in a controlled manner. However,creating a good and reliable DBB is not an easy task. For a DBB to bereliable, some factors need to be considered. Huppler [11] points outthat five characteristics make a general benchmark good:

• Relevant – A reader of the result believes the benchmark reflectssomething important.

• Repeatable – There is confidence that the benchmark can be runa second time with the same result.

• Fair – All systems or software being compared can participateequally.

• Verifiable – There is confidence that the documented result iscorrect.

• Economical – The test sponsors can afford to run the benchmark.

According to Huppler [11], creating a perfect benchmark with all ofthese characteristics is almost impossible. Thus, benchmark developersneed to abandon some of the characteristics and focus on the others.

The number of DBBs for relational databases are plenty. The most com-mon standards for RDBMSs are TPC-C, TPC-H and TPC-E benchmarks[7], but there are other open-source alternatives too. All of these focuson relational operations such as JOINs, projections, selections, aggrega-tions and sorting. However, these types of DBBs do not work for graphdatabases since graphs operate under a different principle.

Another type of DBB is the OO7 benchmark by Carey, Dewitt, andNaughton [6]. This benchmark is for object-oriented databases (OODB),which has similarities to graph databases. Data stored in an OODBcan follow a graph-like structure, where the entities have relationshipsamong themselves. The OO7 benchmark contains a set of queries thatare categorized into two groups: traversal queries and general queries.Even though OODB benchmarks can create graphs, their structures


are not like graphs in a graph database or graph analysis. That is be-cause OODB relationships are more like that of references in a relationaldatabase than pointers in a graph database.

Currently, there are only a handful of graph-oriented DBBs. One is theHPC Scalable Graph Analysis Benchmark [3]. This DBB contains fourdifferent categories of queries:

• Insert the graph database as a bulk load.

• Retrieve the set of edges with maximum weight.

• Perform a k-hops operation.

• Calculate the centrality of a graph, where the performance ismeasured in edges traversed per second.

In this project, the comparison is between two different types of databases(relational and graph) and at this moment there is no DBB that operateson cross paradigms. Thus, further details regarding the implementationof the benchmark for this project and how it reflects the characteristicsthat Huppler [11] pointed out will be presented in Chapter 4.

2.4 Database Modeling

When benchmarking an abstraction of a system, many factors con-tribute to its performance and the modeling of the abstraction is oneof them. Creating a model that correctly and efficiently represents asystem is crucial to the performance and is therefore essential to thisproject.

Creating a model of a system for a database can be done in differentways. Depending on the underlying data structure of the database,the procedures to develop a model that efficiently works for that spe-cific database may be tedious and difficult. Generally, for both graphdatabases and relational databases, the starting point when modelinga system is to understand the domain of the system. From the under-standing of the domain, entities and properties can be defined by howthey interrelate and what rules govern the domain. Most of this tendsto be quite informal and are often paper sketches. However, from here,


the modeling characteristics diverge. Figure 2.7 shows an example ofa domain model representing a small electrical circuit. There you cansee several components that are connected together according to thedomain model. For example, several fuses can be connected to one busbar and one cable can be connected to one fuse.

Figure 2.7: An example of a small electrical circuit domain model.

2.4.1 Relational Modeling

In the case of relational databases, the process of creating a model givena domain can be done in different ways. One way to model a systemgiven an understanding of the domain is to first create a diagram ofthe system. This diagram shows all the identified entities and howthey are connected, similar to Figure 2.7. The next step is to create alogical model, called an Entity Relation diagram (E-R diagram). This isa more comprehensive graph with more information about the entities.This information can be anything related to the entity, usually relevantproperties that define the entity.

Having an E-R diagram with all needed information, the developercan map it into tables and normalize them to reduce redundancy. Nor-malizing is a technique to reduce redundant data in tables by splittingthem into intermediate tables and creating references between the in-termediate tables.

At this point, the tables are normalized and relatively corresponds tothe domain and the process of creating a model can now be seen ascomplete. However, one of the problems of normalized models is thatthey are typically not fast enough for real-world applications; to satisfyreal-world requirements, they need to be changed to suit the database


engine through denormalization.

Denormalization involves finding out what redundant data to intro-duce in order to reduce the complexity of a query to improve efficiency.An example of this could be to introduce redundant data to reducethe number of JOIN operations needed in a query. However, when adatabase contains many tables and queries contain many JOIN opera-tions, denormalization can be difficult. Also, given the total lifespanof a model, this may be difficult to perfect since changes to the modelmay happen not only during development but also during its use.

2.4.2 Graph Modeling

In the case of native graph databases, the process of creating a graphmodel given an understanding of the domain is more straightforward.The first step is to create a graph of the domain, similar to the firstdiagram when creating a relational model, and then include all theinformation of each entity/node to create an accurate representationof the domain. This means each node captures its appropriate label,properties and relationships to other nodes.

Domain modeling is usually identical to graph modeling; by represent-ing the domain model as accurately as possible, the graph model getsequally represented. An advantage of modeling in this manner is thatthe semantic context remains intact. Each node and relationship stillhas its domain representation, and there is no need for normalizationor denormalization to make the model efficient.

At this point, the graph model is done and only needs to be transcribedinto the graph database. This manner of modeling a system for a graphdatabase can easily be described as "What you draw, is what you store"by Webber [22], meaning how you draw it on a whiteboard or paper ishow it will be stored and represented in the graph database.

2.5 Query Execution Time Estimation

When benchmarking execution time, there is a need to understandwhich steps within the query process take time. However, executiontime is not simple to estimate. There are a several factors to take into


account when estimating the execution time. It primarily depends onthe setup of the environment itself. A poorly designed environmentconsists of components set up in such a way that they will negativelyaffect the execution time. A well designed environment will insteadminimize the access time, storage time, computation time and commu-nication time.

All factors take into account different operations and the total timeestimation is a combination of them. For example, a distributed en-vironment can have a more considerable communication time than astandard local environment (i.e. everything is run on the same com-puter). A standard local setup typically has a more sizable access timethan a local full RAM environment (i.e. the whole database is in pri-mary memory after one read). The following sections explain thesedifferent time factors and what they take into account.

2.5.1 Access Time

Access time is the time it takes for the CPU to get the sought afterdata from secondary storage to primary storage for further processing.Today, this process is quite fast but depends on what type of secondarystorage device the database uses.

Hard Disk Drives (HDD) and Solid State Drives (SDD) are the stan-dard types of secondary storage devices. An SSD does functionallyeverything an HDD does, but data are instead stored on flash memorychips whereas HDD stores data in magnetic discs. The main advantagethat an SSD has over an HDD is that it is faster. An SSD can performmore I/O operations per second, which makes SSDs more efficient thanHDDs and contributes a lot to the decrease of access time. Generally, fordatabases where access time is essential, using an SSD is recommended.

2.5.2 Storage Time

Storage time is the time it takes for the CPU to access the data fromprimary memory. Since primary memory is the fastest storage medium,RDBMSs can utilize whole systems with large amounts of primarymemory to fit the entire database. This allows for high-speed perfor-mance and the CPU only needs to access data from secondary storage


once.

2.5.3 Computation Time

Computation time is the time it takes for the query engine to computea query and produce the result. The majority of the computation timecomes from the query processing phase (i.e. parsing, optimizationand execution). This process happens within primary memory, whichmakes this quite fast. However, smaller databases that do not haveenough primary memory space get affected by access time drawbacksand may need to optimize their database structure to decrease the com-putation time in order to increase overall performance.

Modern relational query engines have had a lot of time to improvethe underlying algorithms performing the query processing. Today,a look-up usually has a time complexity of O(log(n)), where n is thenumber or rows in the concerned table. However, a modern nativegraph query engine can have a look-up time complexity of O(1) [18].This makes JOIN-operation-heavy queries quite fast to compute, seesection 2.2.2 for further explanation.

2.5.4 Communication Time

Communication time is the time it takes to send data between environ-ment points. This can be minimal for a local system. However, for adistributed system, this can affect the total time when data need to betransported from a cluster of different databases and used to computea result.

2.6 Database Storage

When benchmarking a system, there is a need to understand the ma-chine that it is performed on. Depending on the storage medium,the performance of the benchmark may vary. The collection of datathat makes up a database needs to be stored physically in this storagemedium. The RDBMS can then retrieve, update and process these datawhen needed. The storage medium forms a storage hierarchy thatincludes two main categories:


• Cache and main memory are classified as primary storage.

• HDD and SSD, as mentioned before, are classified as secondarystorage.

Primary storage provides fast access to data but has limited storagecapacity. However, it has been rapidly growing in recent years. Nev-ertheless, they are still more expensive and have less storage capacitycompared to secondary storage devices.

Secondary storage devices have a more extensive storage capacity butprovide slower access to data than primary storage devices and datacannot be processed directly by the CPU. The CPU first needs to copydata into primary storage in order to access it for processing.

Programs reside and execute in main memory. Generally, large databasesreside on secondary storage; usually, HDD or SDD and parts of thedatabase are read and written into main memory when needed. Typi-cally, database applications require only a small part of the databasewhen in use. Whenever specific data is required, that is not in mainmemory. It first needs to be located in the secondary storage, copiedto primary memory for processing and then re-write it to secondarystorage if the data has been modified.

The benefits of having the database stored in secondary storage medi-ums have some validity. Apart from the cost of storage per unit ofdata, there are other reasons to utilize secondary storage for the entiredatabase. One reason is that databases today can be as large as hun-dreds of terabytes, which in most cases is too large to fit in primarymemory. Another reason is that the circumstances causing permanentloss of stored data arise less frequently for secondary storage than forprimary storage.

Data stored in secondary storage is organized as files of records. Eachrecord is a collection of data values that can be interpreted as facts aboutentities, which could be seen as a row in the database, their attributesand their relationships. Records are stored in the secondary storagein a manner that makes locating them as efficient as possible for thedatabase.

Chapter 3

Related Research

This chapter provides insight into research related to this thesis.

3.1 Benchmarking Database Systems for So-cial Network Applications

In a paper by Angles et al. [2], they perform a benchmark on five dif-ferent databases. The databases used are two graph databases (Dexand Neo4j), one RDF database (RDF-3X) and two relational databases(Virtuoso and PostgreSQL). The purpose of the paper is to investigatehow different database paradigms handle graph workload and in thisparticular case social network workload.

The focus of the paper is on performance and the specific aspects are:

• Data loading time - The time it takes to load the data from asource file.

• Query execution time - The time it takes for the database to exe-cute the query.

• Data indexes - The time it takes to create indexes.

However, the performance aspect of query execution time is the centralfocus of the paper and is also the most relevant aspect for this project.

The results presented in the paper are based on several queries that themicrobenchmark performs and all of them show that graph databases

25

26 CHAPTER 3. RELATED RESEARCH

perform better than relational and RDF databases when the graph in-creases in size. In the paper, the graph size increases from 1 million to10 million nodes. The significant differences between relational andgraph databases appear when the graph size is large.

The paper concludes that graph databases have shorter query executiontimes in comparison to relational databases when the graph size in-creases to the millions. However, this paper was published in 2013 andmost of the databases, especially Neo4j, have been updated to increasetheir performance, meaning the results may be different now. Neverthe-less, the conclusion should still be valid, that graph databases performbetter than relational databases when it comes to social-network re-lated queries and interconnected networks. This can be seen with theincreased interest in graph databases and the wide use of Neo4j [22].

3.2 The Shortest Path Algorithm PerformanceComparison in Graph and Relational Databaseon a Transportation Network

In a paper by Miler, Medak, and Odobasic [15] they conduct a perfor-mance comparison on the Dijkstra’s Shortest Path Algorithm (DSPA) be-tween a relational database (PostgreSQL) and a graph database (Neo4j).In this paper, they model a transportation network and execute DSPAseveral times with different settings. They experiment with differentnumbers of threads and different configurations of Neo4j to see whichdatabase operates the fastest.

The results show that Neo4j performs best with the recommended set-tings. Additionally, although Neo4j performed better than PostgreSQL,the memory required was 20-75% more, depending on the numberof threads used in the computation. The authors’ initial assumptionwas that Neo4j must perform better since the transportation network isstored in the database.

The paper concludes that Neo4j is 30-35% faster than PostgreSQL. How-ever, there is an increased cost of memory with Neo4j. If memory isnot an issue, then Neo4j is the choice for DSPA, according to the au-

CHAPTER 3. RELATED RESEARCH 27

thors. The paper provides some more in-depth insight into Neo4j andthe memory management when performing more advanced comput-ing. Also, although PostgreSQL is not Oracle, it is still relevant whencomparing relational databases and graph databases.

3.3 Relational Database and Graph Database:A Comparative Analysis

In a paper by Medhi and Baruah [14], they perform a benchmark on arelational database (MySQL) and a graph database (Neo4j). The pur-pose is to compare their performances when handling connected data,where the performance metric is execution time.

The benchmark presented in the paper is based on two predefinedqueries on a graph schema representing the cricket sport, where theentities and relationships are as follows:

• Entities

– Player

– Team

– Game

• Relationships

– Player to Team

– Player to Game

The predefined queries used by the authors are simple retrieve queriesthat follow the structure of:

• Find attribute/property X from entity Y.

• Find attribute/property X from entity Y, that has relationships Z.

The results show that Neo4j outperform MySQL for all queries, some-times with a factor of 20. The authors performed the queries on graphsof sizes 100, 300 and 400 nodes. With connected data, Neo4j outper-formed MySQL regarding execution time.

28 CHAPTER 3. RELATED RESEARCH

3.4 Comparative Analysis of Relational andGraph Databases

In a paper by Batra and Tyagi [4], they perform a similar benchmarkto the one in the paper by Medhi and Baruah [14]. The databases arealso the same ones, being MySQL and Neo4j. As with the previousstudy, the purpose of the paper is to compare the performance of thetwo databases with connected data, where the performance metric isexecution time.

The benchmark presented in the paper is based on three predefinedqueries on a graph schema representing relationships between users,movie interests and movie actors. Here the entities and relationshipsare:

• Entities

– User

– Movie

– Actor

• Relationships

– User to User

– User to Movie

– Movie to Actor

The graph scheme used in this paper is a simple undirected graph. Adifference in this study, compared to the previous one, is that the graphseems to be more focused on reachability, rather than retrieval.

The predefined queries are simple reachability queries that follow thestructure of:

• Find all neighbors of X.

• Find all neighbors of X with attribute/property Y.

• Find all neighbors of Y with attribute/property Z and a relation-ship with X.

CHAPTER 3. RELATED RESEARCH 29

The presented results show that Neo4j outperform MySQL for allqueries, sometimes with a factor of 20. The paper performed the threequeries on graph of sizes 100 and 500 nodes. At first glance, the graphsizes seem to small, but according to the authors, the differences in exe-cution time were already sufficiently significant and that larger graphsizes would not give any more insightful results.

The paper and the authors’ conclusions were clear. Like in the previousstudy, Neo4j outperformed MySQL regarding execution time in thecontext of connected data.

Chapter 4

Methodology

This chapter describes what data were used and how the setupand implementation of the electrical network model, environmentof the experiment and the benchmark was conducted.

4.1 Datasets

There are three datasets used in this project, which are representationsof three different electrical networks. The first dataset, the 8K-set, iscompromised of 8000 nodes and is here used for testing and benchmark-ing of small datasets. The two other datasets, the 220K-set and 6.3M-set,are more extensive with 220 000 and 6.3 million nodes respectively andare here used for benchmarking of extensive datasets.

The electrical networks represented by the datasets are logically con-nected such that they depict real-world networks in Sweden. Eachdataset consists of a table that is comprised of rows that represent elec-trical components occurring in the network and columns describingthe properties of a component. For this work, relevant componentproperties are id, type and connection points (c1 and c2, depicting eachconnector end of a component). Other properties are redundant for thiswork, but are still preserved in order to keep the same amount of datain each database.

Each electrical component can be categorized into one out of two maincategories: one connection component (OCC) or two connections com-ponent (TCC). OCC’s can be described as components that have only

30

CHAPTER 4. METHODOLOGY 31

one connection point (i.e. c1 and c2 have the same value) and cantherefore not be logically chained. Examples of OCC’s are bus bars,bays or delivery points. TCC’s instead represent components withtwo connection points (i.e. c1 and c2 have different values) that canbe logically chained. Examples of TCC’s are cables, fuses and circuitbreakers.

The electrical networks that the three datasets represent can be depictedas graphs were each component is coupled, end to end, with anothercomponent similarly to the domain model in Figure 2.7.

4.2 Modeling

The focus of the model that is stored in both the Oracle RDBMS andthe Neo4j graph database is on how the components are connected. Forthe Oracle relational database, a connection table (conntab) was createdwith every component from a dataset for each of the three datasetsdescribed in section 4.1. Since each component is represented as a tablerow holding its properties, the approach to traverse the network wasto iteratively perform look-ups in conntab for the next connections.Figure 4.1 shows an example of two cables connected via a bus bar inthe middle and the circuit’s relationship to the conntab data. Notice thatall the components are in the same table and distinguishing betweenthe OCC and TCC’s is done by looking at their connection points. Here,the bus bar has the same value for c1 and c2, while the cables havedifferent values for c1 and c2, indicating that the bus bar is an OCC andthe cables are TCC’s.

Figure 4.1: A example representation of the relationship between con-ntab and a small connected circuit.

32 CHAPTER 4. METHODOLOGY

For the Neo4j graph database, creating the graph of the electrical net-work was made in a more controlled manner. The algorithm usedcreates the graph step by step as follows:

1. Map connection point to a set of components containing thatconnection point.

2. For each connection point set:

• If an OCC exists in the set, connect all TCC’s in the set to theOCC.

• Else connect all components in the set to each other in anarbitrary direction (in this project, they are connected in thedirection of the higher id).

This algorithm was created during the project and creates a graph thatis similar to the domain model of an electrical circuit. Each componentis represented as a node and thus has its neighbors directly linked dueto index-free adjacency. Figure 4.2 shows the different types of connec-tion point sets that can occur and how they are connected during thealgorithm. Appendix A shows a pseudo-code version of the algorithm.

Figure 4.2: A representation of the different connection point sets thatcan occur and how they are connected.


4.3 Benchmark Framework

The purpose of the benchmark is to measure the execution time for dif-ferent search queries. This is done by sending queries to each databaseand then waiting for the results. The measured execution time is pre-sented in three different statistical forms: average time, standard de-viation and average throughput. Figure 4.3 shows a flowchart of abenchmark test given a search query.

Figure 4.3: A flowchart representation of a benchmark test given aquery.

The framework of the benchmark is built in Java version 8 and usesOracle Java Database Connectivity version 12.2.0.1 and Neo4j JavaDriver version 1.5.1 to connect to the corresponding databases. Usingthese connection APIs, the benchmark can send queries and receiveresults that are then verified. The tests that the benchmark performsare comprised of several runs for each query. The gathered resultsfrom each test are then stripped of the cold runs (i.e. query executionwith empty database cache) to only contain data from the hot runs (i.e.query execution with non-empty cache) for statistical calculations. Thisprocess is performed for both Oracle and Neo4j and verification of theresult accuracy is done by checking that the resulting set of componentsfrom both databases have identical properties.

The measurement of execution time performed by the benchmark isdone using the machine’s internal clock. The time is collected beforethe query is sent and right after the result is received. Using these twotime values an elapsed time is calculated for each run of the query andat the end, the average time is calculated using all the elapsed time

34 CHAPTER 4. METHODOLOGY

values. The measurement of throughput and standard deviation is con-ventionally calculated using statistical formulas [21] and the collecteddata from the execution time measurements.

As explained in section 2.3, five characteristics make a general bench-mark good according to Huppler [11]. These characteristics provide in-sight into making a good benchmark in the aspects of relevance, repeata-bility, fairness, verifiability and economical. However, as mentionedbefore, making a perfect benchmark that takes all of these aspects intoaccount is almost impossible. For that reason, certain non-applicableaspects must be overlooked in order to put more focus on the otheraspects. For this work, the economic aspect can be considered fulfilledor irrelevant because there are no economic factors involved.

This project is of scientific nature and the databases that are being testedneed to be on equal grounds and the results need to be repeatable, veri-fiable and fair for scientific value. Hence, the most important factorsto consider are fairness, repeatability and verifiability. The aspect ofrelevance comes from the measurement unit of the benchmark, whichis also the focus of this project. Relevance is also a natural consequencedue to the fairness and verifiability aspects.

For fairness, both databases are run on similar machines and haveenough memory to load each dataset into main memory, although thememory needed may grow beyond that during execution.

4.3.1 Query

The query used in the benchmark is based on the Breadth-First Searchalgorithm (BFS), which given a component and some stop-property tra-verses through the network. The traversal stops when the stop-propertyis fulfilled or the entire connected network has been traversed. For thisproject, the query is a part of the benchmark suite and is used in twoways: for a complete search query (traversing without stop-property)and a stop-property search query.

Implementing the query was done via an SQL-stored procedure for theOracle RDBMS and through a Cypher-stored procedure for the Neo4jgraph database. For Oracle, the stored procedure was implemented


by Digpro since there are not any pre-existing procedures that can dothese kinds of executions. In contrast, the Neo4j graph database hasa Cypher library called APOC that contains pre-defined proceduresthat can perform graph-related algorithms, including BFS related algo-rithms.

The logic behind the stored procedure for Oracle is that through an iter-ative process we are able to find all components connected to a startingpoint. Step-by-step, this process first finds the connected componentsto a specified starting point, then checks if one of those connected com-ponents fulfills the stop-property. For the rest of the components thatdid not fulfill the stop-property, the query finds the connected compo-nents for each of them and repeats the process. This is done to all thecomponents until there are no more to be found. All of the componentsthat the stored procedure encounters, including the starting point, arecollected and then stored in a list, which at the end of the procedure isreturned.

For the Neo4j graph database, the procedure used was the subgraphN-odes procedure provided by the APOC library. The subgraphNodesprocedure performs BFS-based operations given some parameters, suchas starting node and stop-property, and then returns a set of uniquenodes that corresponds to the operation. One inconvenience with thesubgraphNodes procedure’s BFS algorithm is that it does not includethe nodes that fulfill the stop-property (endnodes). A possible solutionis to perform subgraphNodes twice: one normal BFS and one to findthe outer boundary nodes, being the endsnodes. Appendices B and Cshow the two versions of the query.

Chapter 5

Results

This chapter presents the results in charts with accompanyingexplanations. The full benchmark test results can be seen inAppendices D and E.

5.1 Execution Time

The benchmark performed two tests on three different datasets. Thefirst test was the complete search test that traversed the network with-out any stop-properties. A complete search on each dataset was doneto find a whole graph given a node, and given that each dataset hasa number of graphs, the results are relative to the size of the datasetand the graphs. Figure 5.1 represents the results of the complete searchquery and Figure 5.2 represents the results of the stop-property searchquery.

36

CHAPTER 5. RESULTS 37

8K-set 220K-set 6.3M-set

0

1,000

2,000

3,000

276410

3,518

39 41193

Tim

e(m

s)

OracleNeo4j

Figure 5.1: Execution time result for the complete search query.

The result from the complete search query shows that Neo4j’s executiontime is approximately 5-7 times shorter than Oracle’s execution time.As mentioned before, the results are given relative to the graph size.The graph sizes used for the datasets are 7745, 8268 and 35868 nodesrespectively. This means that it takes 39 ms for Neo4j to find a graphwith 7745 nodes in a dataset of approximately 8000 nodes.


0

500

1,000

123

253

1,193

33 55102

Tim

e(m

s)

OracleNeo4j

Figure 5.2: Execution time result for the stop-label search query.

The result from the stop-property search query shows that Neo4j’s exe-cution time is approximately 4-6 times shorter than Oracle’s executiontime. Here the graph sizes are 3358, 5008 and 48 061 nodes respectively.

38 CHAPTER 5. RESULTS

5.2 Throughput

Another way of looking at execution time is to look at the throughputof nodes (i.e. nodes/ms) from the databases.

8K-set 220K-set 6.3M-set0

100

200

28.0420.15

10.19

198.59 201.66

186.03

Thro

ughp

ut(n

odes

/ms) Oracle

Neo4j

Figure 5.3: Throughput result for the complete search query.

8K-set 220K-set 6.3M-set0

50

100

27.31

19.81

7.27

102.79

91.4585.46

Thro

ughp

ut(n

odes

/ms) Oracle

Neo4j

Figure 5.4: Throughput result for the stop-label search query.

Figures 5.3 and 5.4 show that the throughput from Neo4j is relativelyconstant with some deviation of about 20 nodes/ms, but for Oracleit becomes apparent that the larger the graph size is the lower thethroughput. There is also some difference considering the magnitudeof the throughput, in that Neo4j has a throughput that is 5-10 timeshigher than that of Oracle.

CHAPTER 5. RESULTS 39

5.3 Standard Deviation

Looking further into the results and the differences in execution timebetween each iteration, the standard deviation of each test on eachdataset is presented in Figures 5.5 and 5.6.


0

200

400

600

4 18

722

6 539

Stan

dard

Dev

iati

on(m

s) OracleNeo4j

Figure 5.5: Standard deviation of the complete search results.


0

100

200

300

4 12

330

4 1128

Stan

dard

Dev

iati

on(m

s) OracleNeo4j

Figure 5.6: Standard deviation of the stop-label search results.

The results show that in smaller datasets there is not much of a dif-ference in standard deviation between Oracle and Neo4j. However,when using a larger dataset such as 6.3M-set difference becomes moresignificant.

Chapter 6

Discussion

This chapter discusses the results presented in the previous chap-ter and reflects over if the objective of this project has been fulfilled.

6.1 Benchmark Comparison

Looking at Figures 5.1 and 5.2, we can deduce that Neo4j’s executiontime is shorter than Oracle’s execution time for all tests. This supportsfindings from previous research and the claim that index-free adjacencyproperty is faster than table look-up when it comes to traversing aninterconnected network.

When looking at the standard deviation of each test, shown in Figures5.5 and 5.6, we can deduce that the difference in execution time is rel-ative to the graph size. Furthermore, when the graph size becomeslarger, the standard deviation increases. In the case of the 6.3M-set, thereason for the significant difference in standard deviation could be dueto the size of the graph or the SQL-procedure. The procedure in theOracle database is developed by Digpro and have more functionalitythan the Neo4j counterpart. This could have caused the increase incomputation time when large amount of data is present and therebycause the deviation to increase.

Figures 5.3 and 5.4 show the throughput results of the three datasetsfor the complete search query and the stop-label search query respec-tively. The throughput results for Neo4j are approximately equal, while

40

CHAPTER 6. DISCUSSION 41

for Oracle the throughput decreases as the graph size increases. Thereason behind this may be due to the fact that the time complexity ofOracle’s procedure includes a logarithmic factor, which is not includedin the time complexity of Neo4j’s procedure. This is further discussedin section 6.2.

The throughput results vary significantly between Neo4j and Oracle,where Neo4j has a 5-10 times higher throughput. This is probably dueto the index-free adjacency property of Neo4j, causing graph traversalto be efficient.

6.2 Complexity Analysis

Performance comparison of the Neo4j graph database and the Oraclerelational database can also be done through time complexity analy-sis and execution plan analysis for both databases. Looking into thequeries used, the benefits of having index-free adjacency can be seenin the time complexity as the Neo4j stored procedure time complexityis O(k), where k is the number of nodes in the result. This is becausefor each node it only needs to look at the nodes that are connectedto it and because of index-free adjacency that costs O(1), making thelook-up instant and the only nodes looked at are the nodes in the finalresult. In case of the Oracle stored procedure, the time complexity isO(2klog(n)), where k is the number of nodes in the result and n is thenumber of nodes in the graph. This is because for each node it needs toperform a table look-up for each end of the node, being O(2log(n)). Thisis performed for all of the k nodes in the result, hence making the timecomplexity O(2klog(n)). These complexities support the presented re-sults and show the benefits of index-free adjacency with connected data.

6.3 Execution Plan Analysis

The execution time is affected by the query processing described insection 2.1.4 and 2.2.3. As mentioned before, the query processingfor Neo4j and Oracle have similarities; both have a general pattern ofparsing, optimization and execution. However, going into detail, theirindividual steps use different techniques and are furthermore split up

42 CHAPTER 6. DISCUSSION

into smaller and more precise step. Although, the difference of the tech-niques can influence the execution time, measuring the accumulateddifferences in the final step in the execution plan can provide an insightinto what makes one of them faster then the other. In this project, wehave two queries, one for stop-label search and one for complete search.Both queries use the same logic, the only difference is that they stopat different times, meaning its enough to only check one of them andhighlight the differences.

The execution plan for Neo4j’s complete search query is shown inFigure 6.1.

Figure 6.1: The execution plan for the Cypher stop-label query in Neo4j.

As we can see in the figure, the execution plan does the complete searchtwice and the query starts by performing two index scans over thenodes to find the starting node. The first index scan is made for findingthe start node for the normal BFS and the second index scan is made for


finding the start node for the endnodes search, see section 4.3.1. Theseresults from the index scans are passed to the procedures that performtheir individual traversals. At the end of the procedures, the producedresult-sets are projected (i.e. extracted of the relevant information) thenunified and processed through a "distinct" filter to remove duplicatevalues before being returned as a single result-set. The execution planshows that all data collection is done by the procedure and nowhereelse, which is to be expected.The execution plan for the Oracle procedure, shown in Figure 6.2, isa bit difficult to interpret since all of the computation is done insidea procedure and is only shown as "COLLECTION ITERATOR PICK-LER FETCH". However, since it performs a BFS and most of the timeperforms searches for other nodes with shared connection points, ananalysis of that query shows a good picture of what the procedure does.

Figure 6.2: The execution plan for the SQL procedure in Oracle.

Figure 6.3 shows the execution plan for the search query that the Oracleprocedure performs to find the next nodes. The figure shows that theOracle query engine performs two index scans to find the nodes withthe same connection point values and then returns those rows. This isperformed for every node that the procedure encounters.

Figure 6.3: The execution plan for the search query that the SQL proce-dure performs.

44 CHAPTER 6. DISCUSSION

Looking at the execution plan for Oracle and Neo4j, there is not muchthat highlights the differences. However, since Oracle hides a lot ofthe steps in the execution plan behind the "COLLECTION ITERATORPICKLER FETCH" step, the entire execution plan is shorted down totwo steps, the return of the result in the "SELECT STATEMENT" and thegathering of the result in the procedure in "COLLECTION ITERATORPICKLER FETCH" step. For the Neo4j execution plan it’s easier to seeall the steps and by cutting the execution plan into two steps it canbe seen as similar to the Oracle execution plan. All these steps in theNeo4j Execution plan from the two "NodeUniqueIndexSeek" step tothe "Distinct" step, it can be seen as the procedure step and the "Produc-eResults" step to "Result" step as the Oracle "SELECT STATEMENT".With the procedure steps isolated from both execution plans, they canbe approximately compared to each other. As mentioned earlier, Neo4jtraverses a graph using Index-free Adjacency and Oracle traverses agraph by iteratively searching for the next nodes in the graph. These arethe major factors of the execution time and since Index-free Adjacencyperforms look-up faster than a normal query look-up which highlightsthe benefits of Neo4j compared to Oracle.

For the entire Neo4j procedure step, the database only needs to performtwo searches for the starting nodes in the two "NodeUniqueIndexSeek"and thereafter Neo4j only needs to traverse the graph by chasing thepointers to the next node. Oracle, on the other hand, needs to performone query search for the starting node and then for each node foundcontinue with new searches for the next coming nodes in the graph.This excessive use of query search causes Oracle’s execution time toincrease faster than Neo4j. Hence the large differences in executiontime between the databases.

6.4 Practical Applications

The ways both of these procedures perform a BFS are similar. However,due to their fundamental differences, the Neo4j graph database hasan advantage with its index-free adjacency over the Oracle relationaldatabase.

Looking into the applications of these kinds of queries, the one that is


most useful in practice is the stop-label query. In GIS and NIS, the useof graph traversal is usually limited to local searches, which is a formof traversal until some property is met. However, migrating an entirerelational database to a graph database could face challenging modelingissues. Furthermore, the amount of tables and procedures that wouldneed to be migrated is something that needs to be considered. Therefore,a integration approach may be more preferable.

Chapter 7

Conclusion

This chapter presents the highlights from the results and discus-sion. Suggestions for future work are also included.

7.1 Summary

Both native graph databases and relational databases can performgraph traversal. However, native graph databases have an advantagewith their index-free adjacency.

When analyzing the time complexity and the execution plan of the pro-cedures used in Neo4j and Oracle respectively, it can be seen that bothsupport the results. Furthermore, the statistical analysis confirms thatthe native graph database performs more consistently and efficientlycompared to the relational database.

Although both database systems can perform graph traversal, it can beconcluded that native graph databases perform better regarding execu-tion time of search queries. The Neo4j graph database is significantlymore efficient in retrieving sought after information, in comparison tothe Oracle relational database, and is the recommended database sys-tem for applications handling a large amount of highly interconnecteddata when consistency and efficiency is concerned.

46

CHAPTER 7. CONCLUSION 47

7.2 Future Work

This project leaves the potential for future work to investigate and iden-tify possibilities regarding modeling of electrical networks. The modelcreated for this project was sufficient enough due the scope and timeconstraints, but a more refined model could improve the results evenfurther.

Other potential future work is to conduct other types of comparisons.Such comparisons may be graph databases against alternative databasetypes other than relational, reachability against retrieval queries, ormemory consumption.

Bibliography

[1] R. Angles Rojas, P. Barcelo, and G. Rios. “A Practical Query Lan-guage for Graph DBs”. eng. In: Alberto Mendelzon InternationalWorkshop on Foundations of Data Management (2013).

[2] Renzo Angles et al. “Benchmarking database systems for socialnetwork applications”. eng. In: First International Workshop ongraph data management experiences and systems. GRADES ’13. ACM,June 2013, pp. 1–7. ISBN: 9781450321884.

[3] David A. Bader et al. HPC. URL: http://www.graphanalysis.org/benchmark/ (visited on 03/16/2018).

[4] Shalini Batra and Charu Tyagi. “Comparative Analysis of Re-lational And Graph Databases”. In: International Journal of SoftComputing and Engineering 2.2 (May 2012). ISSN: 2231-2307.

[5] Alan Beaulieu. Learning SQL. eng. 1. ed.. Sebastopol, Calif.: O’Reilly,2005. ISBN: 0-596-00727-2.

[6] Michael Carey, David Dewitt, and Jeffrey Naughton. “The 007Benchmark”. eng. In: Proceedings of the 1993 ACM SIGMOD in-ternational conference on management of data. SIGMOD ’93. ACM,June 1993, pp. 12–21. ISBN: 0897915925.

[7] Transaction Processing Performance Council. TPC. URL: http://www.TPC.org/default.asp (visited on 03/16/2018).

[8] Ramez Elmasri. Fundamentals of database systems. eng. 6. ed.. 2006.ISBN: 0-136-08620-9.

[9] Guy Harrison. Next generation databases : NoSQL, NewSQL, andBig Data. Expert’s voice in Oracle. 2015. ISBN: 1-4842-1329-7.

48

http://www.graphanalysis.org/benchmark/

http://www.graphanalysis.org/benchmark/

http://www.TPC.org/default.asp

http://www.TPC.org/default.asp

BIBLIOGRAPHY 49

[10] Florian Holzschuher and René Peinl. “Querying a graph database– language selection and performance considerations”. eng. In:Journal of Computer and System Sciences 82.1 (Feb. 2016), pp. 45–68.ISSN: 0022-0000.

[11] Karl Huppler. “The Art of Building a Good Benchmark”. In: Per-formance Evaluation and Benchmarking. Ed. by Raghunath Nambiarand Meikel Poess. Berlin, Heidelberg: Springer Berlin Heidelberg,2009, pp. 18–30. ISBN: 978-3-642-10424-4.

[12] IBM. No more joins: An overview of Graph database query languages.URL: https://developer.ibm.com/dwblog/2017/overview-graph-database-query-languages/ (visited on 01/12/2017).

[13] Thomas Kyte. Expert Oracle Database Architecture Third Edition.eng. Third edition.. Expert’s voice in Oracle. 2014. ISBN: 1-4302-6299-0.

[14] Surajit Medhi and Hemanta K. Baruah. “Relational databaseand graph database: a comparative analysis”. In: Journal of Pro-cess Management. New Technologies 5.2 (Apr. 2017), pp. 1–9. ISSN:2334735X.

[15] M Miler, D Medak, and D Odobasic. “The shortest path algorithmperformance comparison in graph and relational database on atransportation network”. English. In: Promet-Traffic & Transporta-tion 26.1 (2014), pp. 75–82. ISSN: 0353-5320.

[16] Guido Moerkotte and Thomas Neumann. “Dynamic Program-ming Strikes Back”. In: Proceedings of the 2008 ACM SIGMODInternational Conference on Management of Data. SIGMOD ’08. Van-couver, Canada: ACM, 2008, pp. 539–552. ISBN: 978-1-60558-102-6.DOI: 10.1145/1376616.1376672. URL: http://doi.acm.org/10.1145/1376616.1376672.

[17] Oracle. Oracle Online Documentation. URL: https : / / docs .oracle.com/database/ (visited on 02/28/2018).

[18] Eifrem. E Robinson. I Webber. J. Graph Databases. O’Reilly Media,Inc., 2015. ISBN: 9781491932001.

[19] Johan Svensson. Guest View: Relational vs. graph databases: Whichto use and when? URL: https://sdtimes.com/databases/guest-view-relational-vs-graph-databases-use/(visited on 01/11/2018).

https://developer.ibm.com/dwblog/2017/overview-graph-database-query-languages/

https://developer.ibm.com/dwblog/2017/overview-graph-database-query-languages/

https://doi.org/10.1145/1376616.1376672

http://doi.acm.org/10.1145/1376616.1376672

http://doi.acm.org/10.1145/1376616.1376672

https://docs.oracle.com/database/

https://docs.oracle.com/database/

https://sdtimes.com/databases/guest-view-relational-vs-graph-databases-use/

https://sdtimes.com/databases/guest-view-relational-vs-graph-databases-use/

50 BIBLIOGRAPHY

[20] Neo Technologies. The Internet-Scale Graph Platform. 2007. URL:www.neo4j.com (visited on 01/11/2018).

[21] Massachusetts Institute of Technology. Basic statistics. URL: http://www.mit.edu/~6.s085/notes/lecture1.pdf (visitedon 06/03/2018).

[22] Jim Webber. Neo4j Graph Database Chief Scientist. Personal Conver-sation/Interview. Neo4j GraphTour Stockholm, 8 March 2018.

www.neo4j.com

http://www.mit.edu/~6.s085/notes/lecture1.pdf

http://www.mit.edu/~6.s085/notes/lecture1.pdf

Appendix A

Neo4j Graph Creation Algorithm

Algorithm 1: Neo4j Graph Creation AlgorithmResult: Create a graph of an electrical networkInput: data-set - Data set with components

db - Database connectionbegin

c-map←mapConnectionPoint(data-set)foreach c-set in c-map do

if c-set.contains(OCC-type) thenocc, tcc-set←splitOCCTCC(c-set)

foreach tcc in tcc-set dodb.createConnection(tcc,occ)

endelse

foreach tcc in c-set doforeach tcc2 in c-set do

if tcc != tcc2 thendb.createConnection(tcc,tcc2)

endend

endend

endend

51

Appendix B

Cypher BFS Complete Search

Algorithm 2: BFS query in cypherResult: Create a graph of an electrical networkInput: relationship-label - label that connects nodesbegin

MATCH (n :Component) WHERE n.DP_OID = "nodeID" AND n.DP_OTYPE = "nodeType"CALL apoc.path.subgraphNodes(n, bfs:true, relationshipFilter:’relationship-label’)YIELD nodeRETURN node

end

52

Appendix C

Cypher BFS Stop-Label Search

Algorithm 3: BFS query in cypherResult: Create a graph of an electrical networkInput: stoplabels - labels the search will stop at

relationship-label - label that connects nodesbegin

MATCH (n :Component) WHERE n.DP_OID = "nodeID" AND n.DP_OTYPE = "nodeType"CALL apoc.path.subgraphNodes(n, bfs:true,labelFilter:’-stoplabels’,relationshipFilter:’relationship-label’) YIELD nodeRETURN nodeUNIONMATCH (n :Component) WHERE n.DP_OID = "nodeID" AND n.DP_OTYPE ="nodeType"CALL apoc.path.subgraphNodes(n, bfs:true, labelFilter:"/stoplabels",relationshipFilter:’relationship-label’) YIELD nodeRETURN node

end

53

Appendix D

Result Data - Complete Search

Test: Complete SearchData Set: 8K-set 220K-set 6.3M-set

Graph Size (nodes): 7745 8268 35868Database: Neo4j Oracle Neo4j Oracle Neo4j Oracle

Iterations (ms): 58 281 39 398 189 316747 271 41 395 157 307742 274 46 408 166 309140 280 43 409 173 290837 280 57 417 247 350444 269 43 426 165 355134 272 41 391 191 541332 274 43 401 168 344633 272 39 398 185 397936 281 36 400 185 290637 272 41 394 157 289540 280 36 394 209 275350 278 41 448 167 303034 278 39 442 180 407240 280 37 396 153 290236 270 40 412 169 347037 278 35 395 185 411533 281 48 452 254 472535 275 36 411 203 336535 280 42 424 236 288539 274 38 406 310 4634

Average (ms): 39 276.19 41 410.33 192.81 3518.48Standard Deviation (ms): 6.39 4.11 4.96 18.36 39.3 722.48Thoughput (nodes/ms) 198.59 28.04 201.66 20.15 186.03 10.19

Table D.1: Benchmark results of the complete search, along with aver-age and standard deviation calculations.

54

Appendix E

Result Data - Stop-Label Search

Test: Stop-Label SearchData Set: 8K-set 220K-set 6.3M-set

Graph Size (nodes): 3358 5008 8676Database: Neo4j Oracle Neo4j Oracle Neo4j Oracle

Iterations (ms): 33 119 50 244 96 97835 121 89 247 85 104733 122 51 267 99 73132 124 48 247 198 135232 120 61 256 99 202036 132 52 293 85 118035 124 58 244 89 96131 123 46 246 102 154731 129 47 242 93 141531 119 58 250 86 104831 123 51 249 85 108331 125 82 258 88 142430 123 64 244 90 81131 123 50 260 102 70332 130 45 247 84 148530 118 47 246 162 145329 124 52 264 101 142630 125 46 248 107 102148 121 47 249 103 148033 118 55 260 84 80832 119 51 249 94 1077

Average (ms): 32.67 122.95 54.76 252.86 101.52 1192.86Standard Deviation (ms): 3.94 3.83 11.48 11.62 27.67 329.55Thoughput (nodes/ms) 102.79 27.31 91.45 19.81 85.46 7.27

Table E.1: Benchmark results of the stop-label search, along with aver-age and standard deviation calculations.

55

TRITA -EECS-EX-2019:71

www.kth.se

execution time analysis of electrical network tracing in ...1304968/fulltext01.pdf · tion systems...

Documents