Book Image

Graph Machine Learning

By : Claudio Stamile, Aldo Marzullo, Enrico Deusebio
5 (1)
Book Image

Graph Machine Learning

5 (1)
By: Claudio Stamile, Aldo Marzullo, Enrico Deusebio

Overview of this book

Graph Machine Learning will introduce you to a set of tools used for processing network data and leveraging the power of the relation between entities that can be used for predictive, modeling, and analytics tasks. The first chapters will introduce you to graph theory and graph machine learning, as well as the scope of their potential use. You’ll then learn all you need to know about the main machine learning models for graph representation learning: their purpose, how they work, and how they can be implemented in a wide range of supervised and unsupervised learning applications. You'll build a complete machine learning pipeline, including data processing, model training, and prediction in order to exploit the full potential of graph data. After covering the basics, you’ll be taken through real-world scenarios such as extracting data from social networks, text analytics, and natural language processing (NLP) using graphs and financial transaction systems on graphs. You’ll also learn how to build and scale out data-driven applications for graph analytics to store, query, and process network information, and explore the latest trends on graphs. By the end of this machine learning book, you will have learned essential concepts of graph theory and all the algorithms and techniques used to build successful machine learning applications.
Table of Contents (15 chapters)
1
Section 1 – Introduction to Graph Machine Learning
4
Section 2 – Machine Learning on Graphs
8
Section 3 – Advanced Applications of Graph Machine Learning

Introduction to graphs with networkx

In this section, we will give a general introduction to graph theory. Moreover, in order to merge theoretical concepts with their practical implementation, we will enrich our explanation with code snippets in Python, using networkx.

A simple undirected graph (or simply, a graph) G is defined as a couple G=(V,E) , where V={, .., } is a set of nodes (also called vertices) and E={{, .., {,}} is a set of two-sets (set of two elements) of edges (also called links), representing the connection between two nodes belonging to V.

It is important to underline that since each element of E is a two-set, there is no order between each edge. To provide more detail, {, and {, represent the same edge.

We now provide definitions for some basic properties of graphs and nodes, as follows:

  • The order of a graph is the number of its vertices |V|. The size of a graph is the number of its edges |E|.
  • The degree of a vertex is the number of edges that are adjacent to it. The neighbors of a vertex v in a graph G is a subset of vertex induced by all vertices adjacent to v.
  • The neighborhood graph (also known as an ego graph) of a vertex v in a graph G is a subgraph of G, composed of the vertices adjacent to v and all edges connecting vertices adjacent to v.

An example of what a graph looks like can be seen in the following screenshot:

Figure 1.1 – Example of a graph

Figure 1.1 – Example of a graph

According to this representation, since there is no direction, an edge from Milan to Paris is equal to an edge from Paris to Milan. Thus, it is possible to move in the two directions without any constraint. If we analyze the properties of the graph depicted in Figure 1.1, we can see that it has order and size equal to 4 (there are, in total, four vertices and four edges). The Paris and Dublin vertices have degree 2, Milan has degree 3, and Rome has degree 1. The neighbors for each node are shown in the following list:

  • Paris = {Milan, Dublin}
  • Milan = {Paris, Dublin, Rome}
  • Dublin = {Paris, Milan}
  • Rome = {Milan}

The same graph can be represented in networkx, as follows:

import networkx as nx
G = nx.Graph()
V = {'Dublin', 'Paris', 'Milan', 'Rome'}
E = [('Milan','Dublin'), ('Milan','Paris'), ('Paris','Dublin'), ('Milan','Rome')]
G.add_nodes_from(V)
G.add_edges_from(E)

Since by default, the nx.Graph() command generates an undirected graph, we do not need to specify both directions of each edge. In networkx, nodes can be any hashable object: strings, classes, or even other networkx graphs. Let's now compute some properties of the graph we previously generated.

All the nodes and edges of the graph can be obtained by running the following code:

print(f"V = {G.nodes}")
print(f"E = {G.edges}")

Here is the output of the previous commands:

V = ['Rome', 'Dublin', 'Milan', 'Paris']
E = [('Rome', 'Milan'), ('Dublin', 'Milan'), ('Dublin', 'Paris'), ('Milan', 'Paris')]

We can also compute the graph order, the graph size, and the degree and neighbors for each of the nodes, using the following commands:

print(f"Graph Order: {G.number_of_nodes()}")
print(f"Graph Size: {G.number_of_edges()}")
print(f"Degree for nodes: { {v: G.degree(v) for v in G.nodes} }")
print(f"Neighbors for nodes: { {v: list(G.neighbors(v)) for v in G.nodes} }") 

The result will be the following:

Graph Order: 4
Graph Size: 4
Degree for nodes: {'Rome': 1, 'Paris': 2, 'Dublin':2, 'Milan': 3}
Neighbors for nodes: {'Rome': ['Milan'], 'Paris': ['Milan', 'Dublin'], 'Dublin': ['Milan', 'Paris'], 'Milan': ['Dublin', 'Paris', 'Rome']}

Finally, we can also compute an ego graph of a specific node for the graph G, as follows:

ego_graph_milan = nx.ego_graph(G, "Milan")
print(f"Nodes: {ego_graph_milan.nodes}")
print(f"Edges: {ego_graph_milan.edges}")

The result will be the following:

Nodes: ['Paris', 'Milan', 'Dublin', 'Rome']
Edges: [('Paris', 'Milan'), ('Paris', 'Dublin'), ('Milan', 'Dublin'), ('Milan', 'Rome')]

The original graph can be also modified by adding new nodes and/or edges, as follows:

#Add new nodes and edges
new_nodes = {'London', 'Madrid'}
new_edges = [('London','Rome'), ('Madrid','Paris')]
G.add_nodes_from(new_nodes)
G.add_edges_from(new_edges)
print(f"V = {G.nodes}")
print(f"E = {G.edges}")

This would output the following lines:

V = ['Rome', 'Dublin', 'Milan', 'Paris', 'London', 'Madrid']
E = [('Rome', 'Milan'), ('Rome', 'London'), ('Dublin', 'Milan'), ('Dublin', 'Paris'), ('Milan', 'Paris'), ('Paris', 'Madrid')]

Removal of nodes can be done by running the following code:

node_remove = {'London', 'Madrid'}
G.remove_nodes_from(node_remove)
print(f"V = {G.nodes}")
print(f"E = {G.edges}")

This is the result of the preceding commands:

V = ['Rome', 'Dublin', 'Milan', 'Paris']
E = [('Rome', 'Milan'), ('Dublin', 'Milan'), ('Dublin', 'Paris'), ('Milan', 'Paris')]

As expected, all the edges that contain the removed nodes are automatically deleted from the edge list.

Also, edges can be removed by running the following code:

node_edges = [('Milan','Dublin'), ('Milan','Paris')]
G.remove_edges_from(node_edges)
print(f"V = {G.nodes}")
print(f"E = {G.edges}")

The final result will be as follows:

V = ['Dublin', 'Paris', 'Milan', 'Rome']
E = [('Dublin', 'Paris'), ('Milan', 'Rome')]

The networkx library also allows us to remove a single node or a single edge from a graph G by using the following commands: G. remove_node('Dublin') and G.remove_edge('Dublin', 'Paris').

Types of graphs

In the previous section, we described how to create and modify simple undirected graphs. Here, we will show how we can extend this basic data structure in order to encapsulate more information, thanks to the introduction of directed graphs (digraphs), weighted graphs, and multigraphs.

Digraphs

A digraph G is defined as a couple G=(V, E), where V={, .., } is a set of nodes and E={(, .., (,)} is a set of ordered couples representing the connection between two nodes belonging to V.

Since each element of E is an ordered couple, it enforces the direction of the connection. The edge , means the node goes into . This is different from , since it means the node goes to .The starting node is called the head, while the ending node is called the tail.

Due to the presence of edge direction, the definition of node degree needs to be extended.

Indegree and outdegree

For a vertex v, the number of head ends adjacent to v is called the indegree (indicated by of v, while the number of tail ends adjacent to v is its outdegree (indicated by ).

An example of what a digraph looks like is available in the following screenshot:

Figure 1.2 – Example of a digraph

Figure 1.2 – Example of a digraph

The direction of the edge is visible from the arrow—for example, Milan -> Dublin means from Milan to Dublin. Dublin has = 2 and = 0, Paris has = 0 and = 2, Milan has = 1 and = 2, and Rome has = 1 and = 0.

The same graph can be represented in networkx, as follows:

G = nx.DiGraph()
V = {'Dublin', 'Paris', 'Milan', 'Rome'}
E = [('Milan','Dublin'), ('Paris','Milan'), ('Paris','Dublin'), ('Milan','Rome')]
G.add_nodes_from(V)
G.add_edges_from(E)

The definition is the same as that used for simple undirected graphs; the only difference is in the networkx classes that are used to instantiate the object. For digraphs, the nx.DiGraph()class is used.

Indegree and Outdegree can be computed using the following commands:

print(f"Indegree for nodes: { {v: G.in_degree(v) for v in G.nodes} }")
print(f"Outdegree for nodes: { {v: G.out_degree(v) for v in G.nodes} }")

The results will be as follows:

Indegree for nodes: {'Rome': 1, 'Paris': 0, 'Dublin': 2, 'Milan': 1}
Outdegree for nodes: {'Rome': 0, 'Paris': 2, 'Dublin': 0, 'Milan': 2}

As for the undirected graphs, G.add_nodes_from(), G.add_edges_from(), G.remove_nodes_from(), and G.remove_edges_from() functions can be used to modify a given graph G.

Multigraph

We will now introduce the multigraph object, which is a generalization of the graph definition that allows multiple edges to have the same pair of start and end nodes.

A multigraph G is defined as G=(V, E), where V is a set of nodes and E is a multi-set (a set allowing multiple instances for each of its elements) of edges.

A multigraph is called a directed multigraph if E is a multi-set of ordered couples; otherwise, if E is a multi-set of two-sets, then it is called an undirected multigraph.

An example of a directed multigraph is available in the following screenshot:

Figure 1.3 – Example of a multigraph

Figure 1.3 – Example of a multigraph

In the following code snippet, we show how to use networkx in order to create a directed or an undirected multigraph:

directed_multi_graph = nx.MultiDiGraph()
undirected_multi_graph = nx.MultiGraph()
V = {'Dublin', 'Paris', 'Milan', 'Rome'}
E = [('Milan','Dublin'), ('Milan','Dublin'), ('Paris','Milan'), ('Paris','Dublin'), ('Milan','Rome'), ('Milan','Rome')]
directed_multi_graph.add_nodes_from(V)
undirected_multi_graph.add_nodes_from(V)
directed_multi_graph.add_edges_from(E)
undirected_multi_graph.add_edges_from(E)

The only difference between a directed and an undirected multigraph is in the first two lines, where two different objects are created: nx.MultiDiGraph() is used to create a directed multigraph, while nx.MultiGraph() is used to build an undirected multigraph. The function used to add nodes and edges is the same for both objects.

Weighted graphs

We will now introduce directed, undirected, and multi-weighted graphs.

An edge-weighted graph (or simply, a weighted graph) G is defined as G=(V, E ,w) where V is a set of nodes, E is a set of edges, and is the weighted function that assigns at each edge a weight expressed as a real number.

A node-weighted graph G is defined as G=(V, E ,w) ,where V is a set of nodes, E is a set of edges, and is the weighted function that assigns at each node a weight expressed as a real number.

Please keep the following points in mind:

  • If E is a set of ordered couples, then we call it a directed weighted graph.
  • If E is a set of two-sets, then we call it an undirected weighted graph.
  • If E is a multi-set, we will call it a weighted multigraph (directed weighted multigraph).
  • If E is a multi-set of ordered couples, it is an undirected weighted multigraph.

An example of a directed edge-weighted graph is available in the following screenshot:

Figure 1.4 – Example of a directed edge-weighted graph

Figure 1.4 – Example of a directed edge-weighted graph

From Figure 1.4, it is easy to see how the presence of weights on graphs helps to add useful information to the data structures. Indeed, we can imagine the edge weight as a "cost" to reach a node from another node. For example, reaching Dublin from Milan has a "cost" of 19, while reaching Dublin from Paris has a "cost" of 11.

In networkx, a directed weighted graph can be generated as follows:

G = nx.DiGraph()
V = {'Dublin', 'Paris', 'Milan', 'Rome'}
E = [('Milan','Dublin', 19), ('Paris','Milan', 8), ('Paris','Dublin', 11), ('Milan','Rome', 5)]
G.add_nodes_from(V)
G.add_weighted_edges_from(E)

Bipartite graphs

We will now introduce another type of graph that will be used in this section: multipartite graphs. Bi- and tripartite graphs—and, more generally, kth-partite graphs—are graphs whose vertices can be partitioned in two, three, or more k-th sets of nodes, respectively. Edges are only allowed across different sets and are not allowed within nodes belonging to the same set. In most cases, nodes belonging to different sets are also characterized by particular node types. In Chapters 7, Text Analytics and Natural Language Processing Using Graphs, and Chapter 8, Graphs Analysis for Credit Cards Transaction, we will deal with some practical examples of graph-based applications and you will see how multipartite graphs can indeed arise in several contexts—for example, in the following scenarios:

  • When processing documents and structuring the information in a bipartite graph of documents and entities that appear in the documents
  • When dealing with transactional data, in order to encode the relations between the buyers and the merchants

A bipartite graph can be easily created in networkx with the following code:

import pandas as pd
import numpy as np
n_nodes = 10
n_edges = 12
bottom_nodes = [ith for ith in range(n_nodes) if ith % 2 ==0]
 top_nodes = [ith for ith in range(n_nodes) if ith % 2 ==1]
iter_edges = zip(
    np.random.choice(bottom_nodes, n_edges),  
    np.random.choice(top_nodes, n_edges))
edges = pd.DataFrame([
    {"source": a, "target": b} for a, b in iter_edges])
B = nx.Graph()
B.add_nodes_from(bottom_nodes, bipartite=0)
 B.add_nodes_from(top_nodes, bipartite=1)
 B.add_edges_from([tuple(x) for x in edges.values])

The network can also be conveniently plotted using the bipartite_layout utility function of networkx, as illustrated in the following code snippet:

from networkx.drawing.layout import bipartite_layout
pos = bipartite_layout(B, bottom_nodes)
 nx.draw_networkx(B, pos=pos)

The bipatite_layout function produces a graph, as shown in the following screenshot:

Figure 1.5 – Example of a bipartite graph

Figure 1.5 – Example of a bipartite graph

Graph representations

As described in the previous sections, with networkx, we can actually define and manipulate a graph by using node and edge objects. In different use cases, such a representation would not be as easy to handle. In this section, we will show two ways to perform a compact representation of a graph data structure—namely, an adjacency matrix and an edge list.

Adjacency matrix

The adjacency matrix M of a graph G=(V,E) is a square matrix (|V| × |V|) matrix such that its element is 1 when there is an edge from node i to node j, and 0 when there is no edge. In the following screenshot, we show a simple example where the adjacency matrix of different types of graphs is displayed:

Figure 1.6 – Adjacency matrix for an undirected graph, a digraph, a multigraph, and a weighted graph

Figure 1.6 – Adjacency matrix for an undirected graph, a digraph, a multigraph, and a weighted graph

It is easy to see that adjacency matrices for undirected graphs are always symmetric, since no direction is defined for the edge. The symmetry instead is not guaranteed for the adjacency matrix of a digraph due to the presence of constraints in the direction of the edges. For a multigraph, we can instead have values greater than 1 since multiple edges can be used to connect the same couple of nodes. For a weighted graph, the value in a specific cell is equal to the weight of the edge connecting the two nodes.

In networkx, the adjacency matrix for a given graph can be computed in two different ways. If G is the networkx of Figure 1.6, we can compute its adjacency matrix as follows:

nx.to_pandas_adjacency(G) #adjacency matrix as pd DataFrame
nt.to_numpy_matrix(G) #adjacency matrix as numpy matrix

For the first and second line, we get the following results respectively:

          Rome  Dublin  Milan  Paris
Rome     0.0     0.0    0.0    0.0
Dublin   0.0     0.0    0.0    0.0
Milan    1.0     1.0    0.0    0.0
Paris    0.0     1.0    1.0    0.0
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [1. 1. 0. 0.]
 [0. 1. 1. 0.]]

Since a numpy matrix cannot represent the name of the nodes, the order of the element in the adjacency matrix is the one defined in the G.nodes list.

Edge list

As well as an adjacency matrix, an edge list is another compact way to represent graphs. The idea behind this format is to represent a graph as a list of edges.

The edge list L of a graph G=(V,E) is a list of size |E| matrix such that its element is a couple representing the tail and the end node of the edge i. An example of the edge list for each type of graph is available in the following screenshot:

Figure 1.7 – Edge list for an undirected graph, a digraph, a multigraph, and a weighted graph

Figure 1.7 – Edge list for an undirected graph, a digraph, a multigraph, and a weighted graph

In the following code snippet, we show how to compute in networkx the edge list of the simple undirected graph G available in Figure 1.7:

print(nx.to_pandas_edgelist(G))

By running the preceding command, we get the following result:

  source  target
0  Milan  Dublin
1  Milan    Rome
2  Paris   Milan
3  Paris  Dublin

Other representation methods, which we will not discuss in detail, are also available in networkx. Some examples are nx.to_dict_of_dicts(G) and nx.to_numpy_array(G), among others.