Book Image

Neo4j High Performance

By : Sonal Raj
Book Image

Neo4j High Performance

By: Sonal Raj

Overview of this book

Table of Contents (15 chapters)
Neo4j High Performance
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Graphs and their utilities


Graphs are a way of representing entities and the connections between them. Mathematically, graphs can be defined as collections of nodes and edges that denote entities and relationships. The nodes are data entities whose mutual relationships are denoted with the help of edges. Undirected graphs have two-way connections between edges whereas a directed graph has only a one-way edge between the nodes. We can also record the value of an edge and that is referred to as the weight of the graph.

Modern datasets of science, government, or business are diverse and interrelated, and for years we have been developing data stores that have tabular schema. So, when it comes to highly connected data, tabular data stores offer retarded and highly complex operability. So, we started creating data stores that store data in the raw form in which we visualize them. This not only makes it easier to transform our ideas into schemas but the whiteboard friendliness of such data stores also makes it easy to learn, deploy, and maintain such data stores. Over the years, several databases were developed that stored their data structurally in the form of graphs. We will look into them in the next section.

Introducing NoSQL databases

Data has been growing in volume, changing more rapidly, and has become more structurally varied than what can be handled by typical relational databases. Query execution times increase drastically as the size of tables and number of joins grow. This is because the underlying data models build sets of probable answers to a query before filtering to arrive at a solution. NoSQL (often interpreted as Not only SQL) provides several alternatives to the relational model.

NoSQL represents the new class of data management technologies designed to meet the increasing volume, velocity, and variety of data that organizations are storing, processing, and analyzing. NoSQL comprises diverse different database technologies, and it has evolved as a response to an exponential increase in the volume of data stored about products, objects, and consumers, the access frequency of this data, along with increased processing and performance requirements. Relational databases, on the contrary, find it difficult to cope with the rapidly growing scale and agility challenges that are faced by modern applications, and they struggle to take advantage of the cheap, readily available storage and processing technologies in the market.

Often referred to as NoSQL, nonrelational databases feature elasticity and scalability. In addition, they can store big data and work with cloud computing systems. All of these factors make them extremely popular. NoSQL databases address the opportunities that the relational model does not, including the following:

  • Large volumes of structure-independent data (including unstructured, semi-structured, and structured data)

  • Agile development sprints, rapid iterations, and frequent repository pushes for the code

  • Flexible, easy-to-use object-oriented programming

  • Efficient architecture that is capable of scaling out, as compared to expensive and monolithic architectures due to the requirement of specialized hardware

Dynamic schemas

In the case of relational databases, you need to define the schema before you can add your data. In other words, you need to strictly follow a format for all data you are likely to store in the future. For example, you might store data about consumers such as phone numbers, first and last names, address including the city and state—a SQL database must be told what you are storing in advance, thereby giving you no flexibility.

Agile development approaches do not fit well with static schemas, since every completion of a new feature requires the schema of your database to change. So, after a few development iterations, if you decide to store consumers' preferred items along with their contact addresses and phone numbers, that column will need to be added to the already existing-database, and then migrate the complete database to an entirely new schema.

In the case of a large database, this is a time-consuming process that involves significant downtime, which might adversely affect the business as a whole. If the application data frequently changes due to rapid iterations, the downtime might be occurring quite often. Businesses sometimes wrongly choose relational databases in situations where the effective addressing of completely unstructured data is needed or the structure of data is unknown in advance. It is also worthy to note that while most NoSQL databases support schema or structure changes throughout their lifetime, some including graph databases adversely affect performance if schema changes are made after considerably large data has been added to the graph.

Automatic sharding

Because of their structure, relational databases are usually vertically scalable, that is, increasing the capacity of a single server to host more data in the database so that it is reliable and continuously available. There are limits to such scaling, both in terms of size and expense. An alternate approach is to scale horizontally by increasing the number of machines rather than the capacity of a single machine.

In most relational databases, sharding across multiple server instances is generally accomplished with Storage Area Networks (SANs) and other complicated arrangements that make multiple hardware act as a single machine. Developers have to manually deploy multiple relational databases across a cluster of machines. The application code distributes the data, queries, and aggregates the results of the queries from all instances of the database. Handling the failure of resources, data replication, and balancing require customized code in the case of manual sharding.

NoSQL databases usually support autosharding out of the box, which means that they natively allow the distribution of data stores across a number of servers, abstracting it from the application, which is unaware of the server pool composition. Data and query load are balanced automatically, and in the case of a node or server failure, it can quickly replace the failed node with no performance drop.

Cloud computing platforms such as Amazon Web Services provide virtually unlimited on-demand capacity. Hence, commodity servers can now provide the same storage and processing powers for a fraction of the price as a single high-end server.

Built-in caching

There are many products available that provide a cache tier to SQL database management systems. They can improve the performance of read operations substantially, but not that of write operations and moreover add complexity to the deployment of the system. If read operations, dominate the application, then distributed caching can be considered, but if write operations dominate the application or an even mix of read and write operations, then a scenario with distributed caching might not be the best choice for a good end user experience.

Most NoSQL database systems come with built-in caching capabilities that use the system memory to house the most frequently used data and doing away with maintaining a separate caching layer.

Replication

NoSQL databases support automatic replication, which means that you get high availability and failure recovery without the use of specialized applications to manage such operations. From the developer's perspective, the storage environment is essentially virtualized to provide a fault-tolerant experience.