Book Image

Mastering Probabilistic Graphical Models with Python

By : Ankur Ankan
Book Image

Mastering Probabilistic Graphical Models with Python

By: Ankur Ankan

Overview of this book

Table of Contents (14 chapters)
Mastering Probabilistic Graphical Models Using Python
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Conditional probability distribution


Let's take an example to understand conditional probability better. Let's say we have a bag containing three apples and five oranges, and we want to randomly take out fruits from the bag one at a time without replacing them. Also, the random variables and represent the outcomes in the first try and second try respectively. So, as there are three apples and five oranges in the bag initially, and . Now, let's say that in our first attempt we got an orange. Now, we cannot simply represent the probability of getting an apple or orange in our second attempt. The probabilities in the second attempt will depend on the outcome of our first attempt and therefore, we use conditional probability to represent such cases. Now, in the second attempt, we will have the following probabilities that depend on the outcome of our first try: , , , and .

The Conditional Probability Distribution (CPD) of two variables and can be represented as , representing the probability of given that is the probability of after the event has occurred and we know it's outcome. Similarly, we can have representing the probability of after having an observation for .

The simplest representation of CPD is tabular CPD. In a tabular CPD, we construct a table containing all the possible combinations of different states of the random variables and the probabilities corresponding to these states. Let's consider the earlier restaurant example.

Let's begin by representing the marginal distribution of the quality of food with Q. As we mentioned earlier, it can be categorized into three values {good, bad, average}. For example, P(Q) can be represented in the tabular form as follows:

Quality

P(Q)

Good

0.3

Normal

0.5

Bad

0.2

Similarly, let's say P(L) is the probability distribution of the location of the restaurant. Its CPD can be represented as follows:

Location

P(L)

Good

0.6

Bad

0.4

As the cost of restaurant C depends on both the quality of food Q and its location L, we will be considering P(C | Q, L), which is the conditional distribution of C, given Q and L:

Location

Good

Bad

Quality

Good

Normal

Bad

Good

Normal

Bad

Cost

      

High

0.8

0.6

0.1

0.6

0.6

0.05

Low

0.2

0.4

0.9

0.4

0.4

0.95

Representing CPDs using pgmpy

Let's first see how to represent the tabular CPD using pgmpy for variables that have no conditional variables:

In [1]: from pgmpy.factors import TabularCPD

# For creating a TabularCPD object we need to pass three
# arguments: the variable name, its cardinality that is the number
# of states of the random variable and the probability value
# corresponding each state.
In [2]: quality = TabularCPD(variable='Quality',
                             variable_card=3,
                                values=[[0.3], [0.5], [0.2]])
In [3]: print(quality)
╒════════════════╤═════╕
│ ['Quality', 0] │ 0.3 │
├────────────────┼─────┤
│ ['Quality', 1] │ 0.5 │
├────────────────┼─────┤
│ ['Quality', 2] │ 0.2 │
╘════════════════╧═════╛
In [4]: quality.variables
Out[4]: OrderedDict([('Quality', [State(var='Quality', state=0), 
                                  State(var='Quality', state=1), 
                                  State(var='Quality', state=2)])])

In [5]: quality.cardinality
Out[5]: array([3])

In [6]: quality.values
Out[6]: array([0.3, 0.5, 0.2])

You can see here that the values of the CPD are a 1D array instead of a 2D array, which you passed as an argument. Actually, pgmpy internally stores the values of the TabularCPD as a flattened numpy array. We will see the reason for this in the next chapter.

In [7]: location = TabularCPD(variable='Location',
                              variable_card=2,
                              values=[[0.6], [0.4]])
In [8]: print(location)
╒═════════════════╤═════╕
│ ['Location', 0] │ 0.6 │
├─────────────────┼─────┤
│ ['Location', 1] │ 0.4 │
╘═════════════════╧═════╛

However, when we have conditional variables, we also need to specify them and the cardinality of those variables. Let's define the TabularCPD for the cost variable:

In [9]: cost = TabularCPD(
                      variable='Cost',
                      variable_card=2,
                      values=[[0.8, 0.6, 0.1, 0.6, 0.6, 0.05],
                              [0.2, 0.4, 0.9, 0.4, 0.4, 0.95]],
                      evidence=['Q', 'L'],
                      evidence_card=[3, 2])