Book Image

Neo4j High Performance

By : Sonal Raj
Book Image

Neo4j High Performance

By: Sonal Raj

Overview of this book

Table of Contents (15 chapters)
Neo4j High Performance
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

The Neo4j graph database


Neo4j is one of the most popular graph databases today. It was developed by Neo Technology, Inc. operating from the San Francisco Bay Area in the U.S. It is written in Java and is available as open source software. Neo4j is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables. Most graph databases available have a storage format of two types:

  • Most graph databases store data in the relational way internally, but they abstract it with an interface that presents operations, queries, and interaction with the data in a simpler and more graphical manner.

  • Some graph databases such as Neo4j are native graph database systems. It means that they store the data in the form of nodes and relationships inherently. They are faster and optimized for more complex data.

In the following sections, we will see an overview of the Neo4j fundamentals, basic CRUD operations, along with the installation and configuration of Neo4j in different environments.

ACID compliance

Contrary to popular belief, ACID does not contradict or negate the concept of NoSQL. NoSQL fundamentally provides a direct alternative to the explicit schema in classical RDBMSes. It allows the developer to treat things asymmetrically, whereas traditional engines have enforced rigid sameness across the data model. The reason this is so interesting is because it provides a different way to deal with change, and for larger datasets, it provides interesting opportunities to deal with volumes and performance. In other words, the transition is about shifting the handling of complexity from the database administrators to the database itself.

Transaction management has been the talking point of NoSQL technologies since they started to gain popularity. The trade-off of transactional attributes for performance and scalability has been the common theme in nonrelational technologies that targeted big data. Some databases (for example, BigTable, Cassandra, and CouchDB) opted to trade-off consistency. This allowed clients to read stale data and in some cases, in a distributed system (eventual consistency), or in key-value stores that concentrated on read performance, where durability of the data was not of too much interest (for example, Memcached), or atomicity on a single-operation level, without the possibility to wrap multiple database operations within a single transaction, which is typical for document-oriented databases. Although devised a long time ago for relational databases, transaction attributes are still important in the most practical use cases. Neo4j has taken a different approach here. Neo4j's goal is to be a graph database, with the emphasis on database. This means that you'll get full ACID support from the Neo4j database:

  • Atomicity (A): This can wrap multiple database operations within a single transaction and make sure that they are all executed atomically; if one of the operations fails, a rollback is performed on the entire transaction.

  • Consistency (C): With this, when you write data to the Neo4j database, you can be sure that every client accessing the database afterwards will read the latest updated data.

  • Isolation (I): This will make sure that operations within a single transaction will be isolated one from another so that writes in one transaction won't affect reads in another transaction.

  • Durability (D): With this, you're certain that the data you write to Neo4j will be written to disk and available after a database restart or a server crash. If the system blows up (hardware or software), the database will pick itself back up.

The ACID transactional support provides seamless transition to Neo4j for anyone used to relational databases and offers safety and convenience in working with graph data.

Transactional support is one of the strong points of Neo4j, which differentiates it from the majority of NoSQL solutions and makes it a good option not only for NoSQL enthusiasts but also in enterprise environments. It is also one of the reasons for its popularity in big data scenarios.

Characteristics of Neo4j

Graph databases are built with the objective of optimizing transactional performance and are engineered to persist transactional integrity and operational availability. Two properties are useful to understand when investigating graph database technologies:

  • The storage within: Some graph databases store data natively as graphs, which is optimized by design for storage, queries, and traversals. However, this is not practiced by all graph data stores. Some databases use serialization of the graph data into an equivalent general-purpose database including object-oriented and relational databases.

  • The processing engine: Some graph databases definitions require that they possess the capability for index-free adjacency, which means that nodes that are connected must physically point to each other in the database. Here, let's take a broader view that any database which, from the user's perspective, behaves like a graph database (that is, exposes a graph data model through CRUD operations) qualifies as a graph database. However, there are significant performance advantages of leveraging index-free adjacency in graph data.

Graph databases, in particular native ones such as Neo4j, don't depend heavily on indexes because the graph itself provides a natural adjacency index. In a native graph database, the relationships attached to a node naturally provide a direct connection to other related nodes of interest. Graph queries largely involve using this locality to traverse through the graph, literally chasing pointers. These operations can be carried out with extreme efficiency, traversing millions of nodes per second, in contrast to joining data through a global index, which is many orders of magnitude slower. There are several different graph data models, including property graphs, hypergraphs, and triples. Let's take a brief look at them:

  • Property graphs: A property graph has the following characteristics:

    • Being a graph, it has nodes and relationships

    • The nodes can possess properties (in the form of key-value pairs)

    • The relationships have a name and direction and must have a start and end node

    • The relationships are also allowed to contain properties

  • Hypergraphs: A hypergraph is a generalized graph model in which a relationship (called hyperedge) can connect any number of nodes. Whereas the property graph model permits a relationship to have only one start node and one end node, the hypergraph model allows any number of nodes at either end of a relationship. Hypergraphs can be useful where the domain consists mainly of many-to-many relationships.

  • Triples: Triple stores come from the Semantic Web movement, where researchers are interested in large-scale knowledge inference by adding semantic markup to the links that connect web resources. To date, very little of the web has been marked up in a useful fashion, so running queries across the semantic layer is uncommon. Instead, most efforts in the Semantic Web movement appear to be invested in harvesting useful data and relationship information from the web (or other more mundane data sources, such as applications) and depositing it in triple stores for querying.

Some essential characteristics of the Neo4j graph databases are as follows:

  • They work well with web-based application scenarios including metadata annotations, wikis, social network analysis, data tagging, and other hierarchical datasets.

  • It provides a graph-oriented model along with a visualization framework for the representation of data and query results.

  • A decent documentation with an active and responsive e-mail list is a blessing for developers. It has a few releases and great utility indicating that it might last a while.

  • Compatible bindings are written for most languages including Python, Java, Closure, and Ruby. Bindings for .NET are yet to be written. The REST interface is the recommended approach for access to the database.

  • It natively includes a disk-based storage manager that has been completely optimized to store graphs to provide enhanced performance and scalability. It is also ready for SSDs.

  • It is highly scalable. A single instance of Neo4j can handle graphs containing billions of nodes and relationships.

  • It comes with a powerful traversal framework that is capable of handling speedy traversals in a graph space.

  • It is completely transactional in nature. It is ACID compliant and supports features such as JTA or JTS, 2PC, XA, Transaction Recovery, Deadlock Detection, and so on.

  • It is built to durably handle large graphs that don't fit in memory.

  • Neo4j can traverse graph depths of more than 1,000 levels in a fraction of a second.

The basic CRUD operations

Neo4j stores data in entities called nodes. Nodes are connected to each other with the help of relationships. Both nodes and relationships can store properties or metadata in the form of key-value pairs. Thus, inherently a graph is stored in the database. In this section, we look at the basic CRUD operations to be used in working with Neo4j:

CREATE ( gates  { firstname: 'Bill', lastname: 'Gates'} )

CREATE ( page  { firstname: 'Larry', lastname: 'Page'}), (page) - [r:WORKS_WITH] - > (gates)

RETURN gates, page, r

In this example, there are two queries; the first is about the creation of a node that has two properties. The second query performs the same operation as the first one, but also creates a relationship from page to gates.

START n=node(*) RETURN "The node count of the graph is "+count(*)+" !" as ncount;

A variable named ncount is returned with the The node count of the graph is 2! value; it's basically the same as select count(*).

START self=node(1) MATCH self<--friend
RETURN friend

Assuming that we are using this simple database as an example, these commands will return the page node keeping in mind the direction of the relationship:

START person=node(*)
MATCH person
WHERE person.firstname! ='Bill'
RETURN person

This query searches through all nodes and matches the ones with the firstname property that is equal to Bill. The ! symbol makes sure that only nodes that possess the property are to be taken into consideration, to prevent errors.

START person=node(*)
MATCH person
WHERE person.firstname! ='Bill'
SET person.age = '60'
RETURN person

The node that has the firstname property as Bill is searched and adds another property called age that has the value 60.

START person = node(*)
MATCH person
WHERE person.firstname! = "Larry" 
DELETE person

In this query, we match all nodes that have firstname equal to Larry and perform a delete operation on them.

START node = node(*)
MATCH node-[r]-()
DELETE node, r

This query is used to fetch all nodes and relationships and performs a delete operation on them.

So, you now know how to perform basic CRUD operations on a Neo4j graph. We will encounter more of these queries in more complex forms in later chapters in the book.