Book Image

Graph Data Modeling in Python

By : Gary Hutson, Matt Jackson
Book Image

Graph Data Modeling in Python

By: Gary Hutson, Matt Jackson

Overview of this book

Graphs have become increasingly integral to powering the products and services we use in our daily lives, driving social media, online shopping recommendations, and even fraud detection. With this book, you’ll see how a good graph data model can help enhance efficiency and unlock hidden insights through complex network analysis. Graph Data Modeling in Python will guide you through designing, implementing, and harnessing a variety of graph data models using the popular open source Python libraries NetworkX and igraph. Following practical use cases and examples, you’ll find out how to design optimal graph models capable of supporting a wide range of queries and features. Moreover, you’ll seamlessly transition from traditional relational databases and tabular data to the dynamic world of graph data structures that allow powerful, path-based analyses. As well as learning how to manage a persistent graph database using Neo4j, you’ll also get to grips with adapting your network model to evolving data requirements. By the end of this book, you’ll be able to transform tabular data into powerful graph data models. In essence, you’ll build your knowledge from beginner to advanced-level practitioner in no time.
Table of Contents (16 chapters)
Part 1: Getting Started with Graph Data Modeling
Part 2: Making the Graph Transition
Part 3: Storing and Productionizing Graphs
Part 4: Graphing Like a Pro

Comparing RDBs and GDBs

RDBs have been a standard for data storage and data analysis across most industries for a very long time. Their strength lies in being able to hold multiple tables of different information, where some table fields are found across tables, enabling data linkage.

With this data linkage, complex questions can be asked of data in an RDB. However, there are drawbacks to this relational structure. While RDBs are useful for returning a large number of rows that meet particular criteria, they are not suited to questions involving many chained relationships.

To illustrate this, consider a standard database containing train services and their station stops, alongside a graph that might represent the same information:

Figure 1.6 – Relational data structure of trains and their stops

Figure 1.6 – Relational data structure of trains and their stops

In an RDB structure, it would not be difficult to retrieve all trains that service a particular stop. On the other hand, it may be a slow operation that returns a series of trains that can be taken between two chosen stations.

Consider the steps needed in a traditional RDB to find the route between Truro and Glasgow Central in the preceding table. Starting at Truro, we would know the GW1426 train service stops at Truro, Liskeard, and Plymouth. Knowing that these stations can be reached from Truro, we would then need to find what train services stop at each of these stations to find our route.

Upon finding that Plymouth station is reachable and that a separate service runs to many more stations, we would need to repeat this process over and over until Glasgow Central is reached.

These steps essentially result in a series of computationally costly join operations on the same table, where one resulting row would give us the path between our stations of interest.

GDBs to the rescue

Using a graph structure to represent this train network puts greater emphasis on relationships between our data points, as illustrated in the following diagram:


Figure 1.7 – Graph data structure of trains and their stops

Figure 1.7 – Graph data structure of trains and their stops

Using a graph structure to represent this train network puts greater emphasis on relationships between our data points. Starting from Truro station, as in the RDB example, we find the train that services that station. However, when traversing the graph to find a possible route between Truro and Glasgow Central, at each station or train node we are considering fewer data points, and therefore fewer options.

This is in contrast to the RDB example, where repeated table joins are necessary to return a path. In this case, the complexity of the operations required over the graph representation is lower, which equates to a faster, more efficient method. Among many other use cases, those that require some sort of pathfinding often benefit from a graph data model.

In addition to being more suitable for specific types of queries, graphs are typically useful where a flexible, evolving data model is needed. Again, using the example of the train network, imagine that, as the database administrator, you have received a request to add bus transport links to the data model.

With an RDB, a new table would be required, since several bus services would likely serve each train station. In this new table, the names of each station would need to be duplicated from the existing table, to list alongside their related bus services.

Not only does this duplication increase the size of data stored, but it also increases the complexity of the database schema:

Figure 1.8 – Adding a new data type (buses) to the train station graph

Figure 1.8 – Adding a new data type (buses) to the train station graph

Where the train station data is represented with a graph, the new information on buses can be added directly to the existing database as a new node type.

There is no requirement for a new table, and no need to duplicate each station node to represent the required information; the existing train nodes can be directly linked to new Bus nodes. This holds for any new data type that would require the addition of a new table in a traditional RDB.

In a graph, where new data could be represented in an equivalent RDB as a new column in an existing table, this may be a good candidate for a node property, as opposed to a new node type.

Here, an example suitable for being represented as a node property would be a code for each train station, where stations and their codes have a 1-to-1 relationship.

A comparison, in short, is captured in the following:

  • RDBs have a rigid data format and a new table must be added for a new type of data. GDBs are more flexible when it comes to the format of the data and can be extended with new node types.
  • RDBs can be queried via path-based queries – for example, how many steps between two people in a friend network, which involves multiple joins and can be extremely slow as the paths become longer. GDBs query paths directly, with no join operations, so information retrieval is more streamlined and quite frankly faster.

In summary, where the use case for a database concerns querying many relationships between objects – that is, paths – or when a flexible data schema is needed, a graph data model is likely to be a good fit to represent your data.