Implementing temporal difference learning

This recipe will walk you through how to implement the temporal difference (TD) learning algorithm. TD algorithms allow us to incrementally learn from incomplete episodes of agent experiences, which means they can be used for problems that require online learning capabilities. TD algorithms are useful in model-free RL settings as they do not depend on a model of the MDP transitions or rewards. To visually understand the learning progression of the TD algorithm, this recipe will also show you how to implement the GridworldV2 learning environment, which looks as follows when rendered:

Figure 2.6 – The GridworldV2 learning environment 2D rendering with state values and grid cell coordinates

Getting ready

To complete this recipe, you will need to activate the tf2rl-cookbook Python/conda virtual environment and run pip install numpy gym. If the following import statements run without issues, you are ready to get started:

import gym
import matplotlib.pyplot as plt
import numpy as np

Now, we can begin.

How to do it…

This recipe will contain two components that we will put together at the end. The first component is the GridworldV2 implementation, while the second component is the TD learning algorithm's implementation. Let's get started:

We will start by implementing GridworldV2 and then by defining the GridworldV2Eng class:

class GridworldV2Env(gym.Env):
    def __init__(self, step_cost=-0.2, max_ep_length=500,
    explore_start=False):
        self.index_to_coordinate_map = {
            "0": [0, 0],
            "1": [0, 1],
            "2": [0, 2],
            "3": [0, 3],
            "4": [1, 0],
            "5": [1, 1],
            "6": [1, 2],
            "7": [1, 3],
            "8": [2, 0],
            "9": [2, 1],
            "10": [2, 2],
            "11": [2, 3],
        }
        self.coordinate_to_index_map = {
            str(val): int(key) for key, val in self.index_to_coordinate_map.items()
        }

In this step, you will continue implementing the __init__ method and define the necessary values that define the size of the Gridworld, the goal location, the wall location, and the location of the bomb, among other things:

self.map = np.zeros((3, 4))
        self.observation_space = gym.spaces.Discrete(1)
        self.distinct_states = [str(i) for i in \
                                 range(12)]
        self.goal_coordinate = [0, 3]
        self.bomb_coordinate = [1, 3]
        self.wall_coordinate = [1, 1]
        self.goal_state = self.coordinate_to_index_map[
                          str(self.goal_coordinate)]  # 3
        self.bomb_state = self.coordinate_to_index_map[
                          str(self.bomb_coordinate)]  # 7
        self.map[self.goal_coordinate[0]]\
                [self.goal_coordinate[1]] = 1
        self.map[self.bomb_coordinate[0]]\
                [self.bomb_coordinate[1]] = -1
        self.map[self.wall_coordinate[0]]\
                [self.wall_coordinate[1]] = 2
        self.exploring_starts = explore_start
        self.state = 8
        self.done = False
        self.max_ep_length = max_ep_length
        self.steps = 0
        self.step_cost = step_cost
        self.action_space = gym.spaces.Discrete(4)
        self.action_map = {"UP": 0, "RIGHT": 1, 
                           "DOWN": 2, "LEFT": 3}
        self.possible_actions = \
                           list(self.action_map.values())

Now, we can move on to the definition of the reset() method, which will be called at the start of every episode, including the first one:

def reset(self):
        self.done = False
        self.steps = 0
        self.map = np.zeros((3, 4))
        self.map[self.goal_coordinate[0]]\
                [self.goal_coordinate[1]] = 1
        self.map[self.bomb_coordinate[0]]\
                 [self.bomb_coordinate[1]] = -1
        self.map[self.wall_coordinate[0]]\
                [self.wall_coordinate[1]] = 2
        if self.exploring_starts:
            self.state = np.random.choice([0, 1, 2, 4, 6,
                                           8, 9, 10, 11])
        else:
            self.state = 8
        return self.state

Let's implement a get_next_state method so that we can conveniently obtain the next state:

def get_next_state(self, current_position, action):
        next_state = self.index_to_coordinate_map[
                            str(current_position)].copy()
        if action == 0 and next_state[0] != 0 and \
        next_state != [2, 1]:
            # Move up
            next_state[0] -= 1
        elif action == 1 and next_state[1] != 3 and \
        next_state != [1, 0]:
            # Move right
            next_state[1] += 1
        elif action == 2 and next_state[0] != 2 and \
        next_state != [0, 1]:
            # Move down
            next_state[0] += 1
        elif action == 3 and next_state[1] != 0 and \
        next_state != [1, 2]:
            # Move left
            next_state[1] -= 1
        else:
            pass
        return self.coordinate_to_index_map[str(
                                             next_state)]

With that, we are ready to implement the main step method of the GridworldV2 environment:

def step(self, action):
        assert action in self.possible_actions, \
        f"Invalid action:{action}"
        current_position = self.state
        next_state = self.get_next_state(
                               current_position, action)
        self.steps += 1
        if next_state == self.goal_state:
            reward = 1
            self.done = True
        elif next_state == self.bomb_state:
            reward = -1
            self.done = True
        else:
            reward = self.step_cost
        if self.steps == self.max_ep_length:
            self.done = True
        self.state = next_state
        return next_state, reward, self.done

Now, we can move on and implement the temporal difference learning algorithm. Let's begin by initializing the state values of the grid using a 2D numpy array and then set the value of the goal location and the bomb state:

def temporal_difference_learning(env, max_episodes):
    grid_state_values = np.zeros((len(
                               env.distinct_states), 1))
    grid_state_values[env.goal_state] = 1
    grid_state_values[env.bomb_state] = -1

Next, let's define the discount factor, gamma, the learning rate, alpha, and initialize done to False:

    # v: state-value function
    v = grid_state_values
    gamma = 0.99  # Discount factor
    alpha = 0.01  # learning rate
    done = False

We can now define the main outer loop so that it runs max_episodes times, resetting the state of the environment to its initial state at the start of every episode:
```
for episode in range(max_episodes):
        state = env.reset()
```

Now, it's time to implement the inner loop with the temporal difference learning update one-liner:

while not done:
            action = env.action_space.sample()  
              # random policy
            next_state, reward, done = env.step(action)
            # State-value function updates using TD(0)
            v[state] += alpha * (reward + gamma * \
                                v[next_state] - v[state])
            state = next_state

Once the learning has converged, we want to be able to visualize the state values for each state in the GridwordV2 environment. To do that, we can make use of the visualize_grid_state_values function from value_function_utils:
```
visualize_grid_state_values(grid_state_values.reshape((3, 4)))
```

We are now ready to run the temporal_difference_learning function from our main function:

if __name__ == "__main__":
    max_episodes = 4000
    env = GridworldV2Env(step_cost=-0.1, 
                         max_ep_length=30)
    temporal_difference_learning(env, max_episodes)

The preceding code will take a few seconds to run temporal difference learning for max_episodes. It will then produce a diagram similar to the following:

Figure 2.7 – Rendering of the GridworldV2 environment, with the grid cell coordinates and state values colored according to the scale shown on the right

How it works…

Based on our environment's implementation, you may have noticed that goal_state is located at (0, 3) and that bomb_state is located at (1, 3). This is based on the coordinates, colors, and values of the grid cells:

Figure 2.8 – Rendering of the GridWorldV2 environment with initial state values

The state is linearized and is represented using a single integer indicating each of the 12 distinct states in the GridWorldV2 environment. The following diagram shows a linearized rendering of the grid states to give you a better understanding of the state encoding:

Figure 2.9 – Linearized representation of the states

Now that we have seen how to implement temporal difference learning, let's move on to building Monte Carlo algorithms.

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

TensorFlow 2 Reinforcement Learning Cookbook

By : Palanisamy

TensorFlow 2 Reinforcement Learning Cookbook

By: Palanisamy

Overview of this book

Implementing temporal difference learning

Getting ready

How to do it…

How it works…

TensorFlow 2 Reinforcement Learning Cookbook

By : Palanisamy

TensorFlow 2 Reinforcement Learning Cookbook

By: Palanisamy

Overview of this book

Implementing temporal difference learning

Getting ready

How to do it…

How it works…

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access