Book Image

Learning Neo4j 3.x - Second Edition

By : Jerome Baton
Book Image

Learning Neo4j 3.x - Second Edition

By: Jerome Baton

Overview of this book

Neo4j is a graph database that allows traversing huge amounts of data with ease. This book aims at quickly getting you started with the popular graph database Neo4j. Starting with a brief introduction to graph theory, this book will show you the advantages of using graph databases along with data modeling techniques for graph databases. You'll gain practical hands-on experience with commonly used and lesser known features for updating graph store with Neo4j's Cypher query language. Furthermore, you'll also learn to create awesome procedures using APOC and extend Neo4j's functionality, enabling integration, algorithmic analysis, and other advanced spatial operation capabilities on data. Through the course of the book you will come across implementation examples on the latest updates in Neo4j, such as in-graph indexes, scaling, performance improvements, visualization, data refactoring techniques, security enhancements, and much more. By the end of the book, you'll have gained the skills to design and implement modern spatial applications, from graphing data to unraveling business capabilities with the help of real-world use cases.
Table of Contents (24 chapters)
Title Page
Credits
About the Authors
Acknowledgement
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface

Why use graph databases, or not


By now, you should have a good understanding of what graph databases are and how they relate to other database management systems and models. Much of the remainder of this book will be drilling into a bit more detail on the specifics of Neo4j as an example implementation of such a database management system. Before that, however, it makes sense to explore why these kinds of databases are of such interest to modern-day software professionals--developers, architects, project and product managers, and IT directors alike.

The fact of the matter is that there are a number of typical data problems, and database system queries are an excellent match for a graph database, but there are a number of other types of data questions that are not specifically designed to be answered by such systems. Let's explore these for a bit and determine the characteristics of your dataset and your query patterns that will determine whether graph databases are going to be a good fit or not.

Why use a graph database?

When you are trying to solve a problem that meets any of the following descriptions, you should probably consider using a graph database such as Neo4j.

Complex queries

Complex queries are the types of questions that you want to ask of your data that are inherently composed of a number of complex join-style operations. These operations, as every database administrator knows, are very expensive operations in relational database systems, because we need to be computing the Cartesian product of the indices of the tables that we are trying to join. That may be okay for one or two joins between two or three tables in a relational database management system, but as you can easily understand, this problem becomes exponentially bigger with every table join that you add. On smaller datasets, it can become an unsolvable problem in a relational system, and this is why complex queries become problematic.

An example of such a complex query would be finding all the restaurants in a certain London neighborhood that serve Indian food, are open on Sundays, and cater for kids. In relational terms, this would mean joining up data from the restaurant table, the food type table, the opening hours table, the caters for table, and the zip-code table holding the London neighborhoods, and then providing an answer. No doubt there are numerous other examples where you would need to do these types of complex queries; this is just a hypothetical one.

In a graph database, a join operation will never need to be performed: all we need to do is to find a starting node in the database (for example, London), usually with an index lookup, and then just use the index-free adjacency characteristic and hop from one node (London) to the next (Restaurant) over its connecting relationships (Restaurant-[LOCATED_IN]->London). Every hop along this path is, in effect, the equivalent of a join operation. Relationships between nodes can therefore also be thought of as an explicitly stored representation of such a join operation.

We often refer to these types of queries as pattern matching queries. We specify a pattern (refer to the following diagram: blue connects to orange, orange connects to green, and blue connects to green), we anchor that pattern to one or more starting points and we start looking for the matching occurrences of that pattern. As you can see, the graph database will be an excellent tool to spin around the anchor node and figure out whether there are matching patterns connected to it. Non-matching patterns will be ignored, and matching patterns that are not connected to the starting node will not even be considered.

This is actually one of the key performance characteristics of a graph database: as soon as you grab a starting node, the database will only explore the vicinity of that starting node and will be completely oblivious to anything that is not connected to the starting node. The key performance characteristic that follows from this is that query performance is very independent of the dataset size, because in most graphs, everything is not connected to everything. By the same token, as we will see later, performance will be much more dependent on the size of the result set, and this will also be one of the key things to keep in mind when putting together your persistence architecture:

Matching patterns connected to an anchor node

In-the-clickstream queries on live data

We all know that you can implement different database queries--such as the preceding example--in different kinds of database management systems. However, in most alternative systems, these types of queries would yield terrible performance on the live database management systems and potentially endanger the responsiveness of an entire application. The reaction of the relational database management industry, therefore, has been to make sure that these kinds of queries will be done on precalculated, preformatted data that will be specifically structured for this purpose.

This means duplicating data, denormalizing data, and using techniques such as Extract, Transform, and Load (ETL), that are often used in business intelligence systems to create query-specific representations (sometimes also referred to as cubes) for the data at hand. Obviously, these are valuable techniques--the business intelligence industry would not be the billion-dollar industry that it is otherwise--but they are best suited for working with data that can be allowed to be more stale, less than up-to-date. Graph databases will allow you to answer a wider variety of these complex queries, between a web request and web response, on data that will not have to be replicated as much and therefore will be updated in near real time.

Pathfinding queries

Another type of query that is extremely well-suited for graph databases is a query where you will be looking to find out how different data elements are related to each other. In other words, finding the paths between different nodes on your graph. The problem with such queries in other database management systems is that you will actually have to understand the structure of the potential paths extremely well. You will have to be able to tell the database how to jump from table to table, so to speak. In a graph database, you can still do that, but typically you won't. You just tell the database to apply a graph algorithm to a starting point and an endpoint and be done with it. It's up to the database to figure out if and how these data elements are connected to each other and return the result as a path expression for you to use in your system. The fact that you are able to delegate this to the database is extremely useful, and often leads to unexpected and valuable insights.

Obviously, the query categories mentioned are just that: categories. You would have to apply it to any of the fields of research that we discussed earlier in this chapter to really reap the benefits. We will come back to this in later chapters.

When not to use a graph database and what to use instead

As we discussed earlier in this chapter, the whole concept of the category of Not Only SQL databases is all about task-orientation. Use the right tool for the job. So that must also mean that there are certain use cases that graph databases are not as perfectly suited for. Being a fan of graph databases at heart, this obviously is not easy for me to admit, but it would be foolish and dishonest to claim that graph databases are the best choice for every use case. It would not be credible. So, let's briefly touch on a couple of categories of operations that you would probably want to separate from the graph database category that Neo4j belongs to.

The following operations are where I would personally not recommend using a graph database like Neo4j, or at least not in isolation.

Large set-oriented queries

If you think back to what we discussed earlier and think about how graph databases achieve the performance that they do in complex queries, it will immediately follow that there are a number of cases where graph databases will still work, but not be as efficient. If you are trying to put together large lists of things effectively sets, that do not require a lot of joining or require a lot of aggregation (summing, counting, averaging, and so on) on these sets, then the performance of the graph database compared to other database management systems will be not as favorable. It is clear that a graph database will be able to perform these operations, but the performance advantage will be smaller, or perhaps even negative. Set-oriented databases such as relational database management systems will most likely give just as, or even more, performance.

Graph global operations

As we discussed earlier, graph theory has done a lot of fascinating work on the analysis and understanding of graphs in their entirety. Finding clusters of nodes, discovering unknown patterns of relationships between nodes, and defining centrality and/or in-betweenness of specific graph components are extremely interesting and wonderful concepts, but they are very different concepts from the ones that graph databases excel at. These concepts are looking at the graph in its entirety and we refer to them as graph global operations. While graph databases are extremely powerful at answering graph local questions, there is an entire category of graph tools (often referred to as graph processing engines or graph compute engines) that look at the graph global problems.

Many of these tools serve an extremely specific purpose and even use specific hardware and software (usually using lots of memory and CPU horsepower) to achieve their tasks, and they are typically part of a very different side of the IT architecture. Graph processing is typically done in batches, in the background, over the course of several hours/days/weeks and would not typically be well placed between a web request and web response. It's a very different kind of ball game.

Simple aggregate-oriented queries

We mentioned that graphs and graph database management systems are great for complex queries--things that would make your relational system choke. As a consequence, simple queries, where write patterns and read patterns align to the aggregates that we are trying to store, are typically served quite inefficiently in a graph and would be more efficiently handled by an aggregate-oriented Key-value or Document store. If complexity is low, the advantage of using a graph database system will be lower too.

Hopefully, this gives you a better view of the things that graph databases are good at and not so good at.