Graph Data Modeling in Python

By : Gary Hutson, Matt Jackson

Graph Data Modeling in Python

By: Gary Hutson, Matt Jackson

Overview of this book

Graphs have become increasingly integral to powering the products and services we use in our daily lives, driving social media, online shopping recommendations, and even fraud detection. With this book, you’ll see how a good graph data model can help enhance efficiency and unlock hidden insights through complex network analysis. Graph Data Modeling in Python will guide you through designing, implementing, and harnessing a variety of graph data models using the popular open source Python libraries NetworkX and igraph. Following practical use cases and examples, you’ll find out how to design optimal graph models capable of supporting a wide range of queries and features. Moreover, you’ll seamlessly transition from traditional relational databases and tabular data to the dynamic world of graph data structures that allow powerful, path-based analyses. As well as learning how to manage a persistent graph database using Neo4j, you’ll also get to grips with adapting your network model to evolving data requirements. By the end of this book, you’ll be able to transform tabular data into powerful graph data models. In essence, you’ll build your knowledge from beginner to advanced-level practitioner in no time.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share your thoughts

Download a free PDF copy of this book

Part 1: Getting Started with Graph Data Modeling

Free Chapter

Chapter 1: Introducing Graphs in the Real World

Technical requirements

Why should you use graphs?

The fundamentals of nodes and edges and the properties of a graph

Comparing RDBs and GDBs

The use of graphs across various industries

Introduction to NetworkX and igraph

Summary

Chapter 2: Working with Graph Data Models

Technical requirements

Making the transition from tabular to graph data

Implementing the model in Python

Summary

Part 2: Making the Graph Transition

Chapter 3: Data Model Transformation – Relational to Graph Databases

Technical requirements

Recommending a game to a user

From relational to graph databases

Ingestion considerations

Our recommendation system

Summary

Chapter 4: Building a Knowledge Graph

Technical requirements

Introducing knowledge graphs

Cleaning the data for our knowledge graph

Ingesting data into a knowledge graph

Knowledge graph analysis and community detection

Summary

Part 3: Storing and Productionizing Graphs

Chapter 5: Working with Graph Databases

Technical requirements

Using graph databases

Storing a graph in Neo4j

Optimizing travel with Python and Cypher

Moving to ingestion pipelines

Summary

Chapter 6: Pipeline Development

Technical requirements

Graph pipeline development

Designing a schema and pipeline

Making product recommendations

Summary

Chapter 7: Refactoring and Evolving Schemas

Technical requirements

Refactoring reasoning

Effectively evolving with graph schema design

Putting the changes into development

Summary

Part 4: Graphing Like a Pro

Chapter 8: Perfect Projections

Technical requirements

What are projections?

How to use a projection

Putting the projection to work

Summary

Chapter 9: Common Errors and Debugging

Technical requirements

Debugging graph issues

Common igraph issues

Common Neo4j issues

Summary

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share your thoughts

Download a free PDF copy of this book

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Comparing RDBs and GDBs

RDBs have been a standard for data storage and data analysis across most industries for a very long time. Their strength lies in being able to hold multiple tables of different information, where some table fields are found across tables, enabling data linkage.

With this data linkage, complex questions can be asked of data in an RDB. However, there are drawbacks to this relational structure. While RDBs are useful for returning a large number of rows that meet particular criteria, they are not suited to questions involving many chained relationships.

To illustrate this, consider a standard database containing train services and their station stops, alongside a graph that might represent the same information:

Figure 1.6 – Relational data structure of trains and their stops

In an RDB structure, it would not be difficult to retrieve all trains that service a particular stop. On the other hand, it may be a slow operation that returns a series of trains that can be taken between two chosen stations.

Consider the steps needed in a traditional RDB to find the route between Truro and Glasgow Central in the preceding table. Starting at Truro, we would know the GW1426 train service stops at Truro, Liskeard, and Plymouth. Knowing that these stations can be reached from Truro, we would then need to find what train services stop at each of these stations to find our route.

Upon finding that Plymouth station is reachable and that a separate service runs to many more stations, we would need to repeat this process over and over until Glasgow Central is reached.

These steps essentially result in a series of computationally costly join operations on the same table, where one resulting row would give us the path between our stations of interest.

GDBs to the rescue

Using a graph structure to represent this train network puts greater emphasis on relationships between our data points, as illustrated in the following diagram:

Figure 1.7 – Graph data structure of trains and their stops

Using a graph structure to represent this train network puts greater emphasis on relationships between our data points. Starting from Truro station, as in the RDB example, we find the train that services that station. However, when traversing the graph to find a possible route between Truro and Glasgow Central, at each station or train node we are considering fewer data points, and therefore fewer options.

This is in contrast to the RDB example, where repeated table joins are necessary to return a path. In this case, the complexity of the operations required over the graph representation is lower, which equates to a faster, more efficient method. Among many other use cases, those that require some sort of pathfinding often benefit from a graph data model.

In addition to being more suitable for specific types of queries, graphs are typically useful where a flexible, evolving data model is needed. Again, using the example of the train network, imagine that, as the database administrator, you have received a request to add bus transport links to the data model.

With an RDB, a new table would be required, since several bus services would likely serve each train station. In this new table, the names of each station would need to be duplicated from the existing table, to list alongside their related bus services.

Not only does this duplication increase the size of data stored, but it also increases the complexity of the database schema:

Figure 1.8 – Adding a new data type (buses) to the train station graph

Where the train station data is represented with a graph, the new information on buses can be added directly to the existing database as a new node type.

There is no requirement for a new table, and no need to duplicate each station node to represent the required information; the existing train nodes can be directly linked to new Bus nodes. This holds for any new data type that would require the addition of a new table in a traditional RDB.

In a graph, where new data could be represented in an equivalent RDB as a new column in an existing table, this may be a good candidate for a node property, as opposed to a new node type.

Here, an example suitable for being represented as a node property would be a code for each train station, where stations and their codes have a 1-to-1 relationship.

A comparison, in short, is captured in the following:

RDBs have a rigid data format and a new table must be added for a new type of data. GDBs are more flexible when it comes to the format of the data and can be extended with new node types.
RDBs can be queried via path-based queries – for example, how many steps between two people in a friend network, which involves multiple joins and can be extremely slow as the paths become longer. GDBs query paths directly, with no join operations, so information retrieval is more streamlined and quite frankly faster.

In summary, where the use case for a database concerns querying many relationships between objects – that is, paths – or when a flexible data schema is needed, a graph data model is likely to be a good fit to represent your data.

Graph Data Modeling in Python

By : Gary Hutson, Matt Jackson

Graph Data Modeling in Python

By: Gary Hutson, Matt Jackson

Overview of this book

Related Content you might be interested in

Current Title:

Graph Data Modeling in Python

Graph Data Science with Neo4j

Graph Data Processing with Cypher

Hands-On Graph Analytics with Neo4j

Comparing RDBs and GDBs

GDBs to the rescue