Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying TensorFlow 2 Reinforcement Learning Cookbook
  • Table Of Contents Toc
TensorFlow 2 Reinforcement Learning Cookbook

TensorFlow 2 Reinforcement Learning Cookbook

By : Palanisamy
4 (6)
close
close
TensorFlow 2 Reinforcement Learning Cookbook

TensorFlow 2 Reinforcement Learning Cookbook

4 (6)
By: Palanisamy

Overview of this book

With deep reinforcement learning, you can build intelligent agents, products, and services that can go beyond computer vision or perception to perform actions. TensorFlow 2.x is the latest major release of the most popular deep learning framework used to develop and train deep neural networks (DNNs). This book contains easy-to-follow recipes for leveraging TensorFlow 2.x to develop artificial intelligence applications. Starting with an introduction to the fundamentals of deep reinforcement learning and TensorFlow 2.x, the book covers OpenAI Gym, model-based RL, model-free RL, and how to develop basic agents. You'll discover how to implement advanced deep reinforcement learning algorithms such as actor-critic, deep deterministic policy gradients, deep-Q networks, proximal policy optimization, and deep recurrent Q-networks for training your RL agents. As you advance, you’ll explore the applications of reinforcement learning by building cryptocurrency trading agents, stock/share trading agents, and intelligent agents for automating task completion. Finally, you'll find out how to deploy deep reinforcement learning agents to the cloud and build cross-platform apps using TensorFlow 2.x. By the end of this TensorFlow book, you'll have gained a solid understanding of deep reinforcement learning algorithms and their implementations from scratch.
Table of Contents (11 chapters)
close
close

Implementing temporal difference learning

This recipe will walk you through how to implement the temporal difference (TD) learning algorithm. TD algorithms allow us to incrementally learn from incomplete episodes of agent experiences, which means they can be used for problems that require online learning capabilities. TD algorithms are useful in model-free RL settings as they do not depend on a model of the MDP transitions or rewards. To visually understand the learning progression of the TD algorithm, this recipe will also show you how to implement the GridworldV2 learning environment, which looks as follows when rendered:

Figure 2.6 – The GridworldV2 learning environment 2D rendering with 
state values and grid cell coordinates

Figure 2.6 – The GridworldV2 learning environment 2D rendering with state values and grid cell coordinates

Getting ready

To complete this recipe, you will need to activate the tf2rl-cookbook Python/conda virtual environment and run pip install numpy gym. If the following import statements run without issues, you are ready to get started:

import gym
import matplotlib.pyplot as plt
import numpy as np

Now, we can begin.

How to do it…

This recipe will contain two components that we will put together at the end. The first component is the GridworldV2 implementation, while the second component is the TD learning algorithm's implementation. Let's get started:

  1. We will start by implementing GridworldV2 and then by defining the GridworldV2Eng class:
    class GridworldV2Env(gym.Env):
        def __init__(self, step_cost=-0.2, max_ep_length=500,
        explore_start=False):
            self.index_to_coordinate_map = {
                "0": [0, 0],
                "1": [0, 1],
                "2": [0, 2],
                "3": [0, 3],
                "4": [1, 0],
                "5": [1, 1],
                "6": [1, 2],
                "7": [1, 3],
                "8": [2, 0],
                "9": [2, 1],
                "10": [2, 2],
                "11": [2, 3],
            }
            self.coordinate_to_index_map = {
                str(val): int(key) for key, val in self.index_to_coordinate_map.items()
            }
  2. In this step, you will continue implementing the __init__ method and define the necessary values that define the size of the Gridworld, the goal location, the wall location, and the location of the bomb, among other things:
    self.map = np.zeros((3, 4))
            self.observation_space = gym.spaces.Discrete(1)
            self.distinct_states = [str(i) for i in \
                                     range(12)]
            self.goal_coordinate = [0, 3]
            self.bomb_coordinate = [1, 3]
            self.wall_coordinate = [1, 1]
            self.goal_state = self.coordinate_to_index_map[
                              str(self.goal_coordinate)]  # 3
            self.bomb_state = self.coordinate_to_index_map[
                              str(self.bomb_coordinate)]  # 7
            self.map[self.goal_coordinate[0]]\
                    [self.goal_coordinate[1]] = 1
            self.map[self.bomb_coordinate[0]]\
                    [self.bomb_coordinate[1]] = -1
            self.map[self.wall_coordinate[0]]\
                    [self.wall_coordinate[1]] = 2
            self.exploring_starts = explore_start
            self.state = 8
            self.done = False
            self.max_ep_length = max_ep_length
            self.steps = 0
            self.step_cost = step_cost
            self.action_space = gym.spaces.Discrete(4)
            self.action_map = {"UP": 0, "RIGHT": 1, 
                               "DOWN": 2, "LEFT": 3}
            self.possible_actions = \
                               list(self.action_map.values())
  3. Now, we can move on to the definition of the reset() method, which will be called at the start of every episode, including the first one:
    def reset(self):
            self.done = False
            self.steps = 0
            self.map = np.zeros((3, 4))
            self.map[self.goal_coordinate[0]]\
                    [self.goal_coordinate[1]] = 1
            self.map[self.bomb_coordinate[0]]\
                     [self.bomb_coordinate[1]] = -1
            self.map[self.wall_coordinate[0]]\
                    [self.wall_coordinate[1]] = 2
            if self.exploring_starts:
                self.state = np.random.choice([0, 1, 2, 4, 6,
                                               8, 9, 10, 11])
            else:
                self.state = 8
            return self.state
  4. Let's implement a get_next_state method so that we can conveniently obtain the next state:
    def get_next_state(self, current_position, action):
            next_state = self.index_to_coordinate_map[
                                str(current_position)].copy()
            if action == 0 and next_state[0] != 0 and \
            next_state != [2, 1]:
                # Move up
                next_state[0] -= 1
            elif action == 1 and next_state[1] != 3 and \
            next_state != [1, 0]:
                # Move right
                next_state[1] += 1
            elif action == 2 and next_state[0] != 2 and \
            next_state != [0, 1]:
                # Move down
                next_state[0] += 1
            elif action == 3 and next_state[1] != 0 and \
            next_state != [1, 2]:
                # Move left
                next_state[1] -= 1
            else:
                pass
            return self.coordinate_to_index_map[str(
                                                 next_state)]
  5. With that, we are ready to implement the main step method of the GridworldV2 environment:
    def step(self, action):
            assert action in self.possible_actions, \
            f"Invalid action:{action}"
            current_position = self.state
            next_state = self.get_next_state(
                                   current_position, action)
            self.steps += 1
            if next_state == self.goal_state:
                reward = 1
                self.done = True
            elif next_state == self.bomb_state:
                reward = -1
                self.done = True
            else:
                reward = self.step_cost
            if self.steps == self.max_ep_length:
                self.done = True
            self.state = next_state
            return next_state, reward, self.done
  6. Now, we can move on and implement the temporal difference learning algorithm. Let's begin by initializing the state values of the grid using a 2D numpy array and then set the value of the goal location and the bomb state:
    def temporal_difference_learning(env, max_episodes):
        grid_state_values = np.zeros((len(
                                   env.distinct_states), 1))
        grid_state_values[env.goal_state] = 1
        grid_state_values[env.bomb_state] = -1
  7. Next, let's define the discount factor, gamma, the learning rate, alpha, and initialize done to False:
        # v: state-value function
        v = grid_state_values
        gamma = 0.99  # Discount factor
        alpha = 0.01  # learning rate
        done = False
  8. We can now define the main outer loop so that it runs max_episodes times, resetting the state of the environment to its initial state at the start of every episode:
    for episode in range(max_episodes):
            state = env.reset()
  9. Now, it's time to implement the inner loop with the temporal difference learning update one-liner:
    while not done:
                action = env.action_space.sample()  
                  # random policy
                next_state, reward, done = env.step(action)
                # State-value function updates using TD(0)
                v[state] += alpha * (reward + gamma * \
                                    v[next_state] - v[state])
                state = next_state
  10. Once the learning has converged, we want to be able to visualize the state values for each state in the GridwordV2 environment. To do that, we can make use of the visualize_grid_state_values function from value_function_utils:
    visualize_grid_state_values(grid_state_values.reshape((3, 4)))
  11. We are now ready to run the temporal_difference_learning function from our main function:
    if __name__ == "__main__":
        max_episodes = 4000
        env = GridworldV2Env(step_cost=-0.1, 
                             max_ep_length=30)
        temporal_difference_learning(env, max_episodes)
  12. The preceding code will take a few seconds to run temporal difference learning for max_episodes. It will then produce a diagram similar to the following:
Figure 2.7 – Rendering of the GridworldV2 environment, with the grid cell coordinates and state values colored according to the scale shown on the right

Figure 2.7 – Rendering of the GridworldV2 environment, with the grid cell coordinates and state values colored according to the scale shown on the right

How it works…

Based on our environment's implementation, you may have noticed that goal_state is located at (0, 3) and that bomb_state is located at (1, 3). This is based on the coordinates, colors, and values of the grid cells:

Figure 2.8 – Rendering of the GridWorldV2 environment with initial state values

Figure 2.8 – Rendering of the GridWorldV2 environment with initial state values

The state is linearized and is represented using a single integer indicating each of the 12 distinct states in the GridWorldV2 environment. The following diagram shows a linearized rendering of the grid states to give you a better understanding of the state encoding:

Figure 2.9 – Linearized representation of the states

Figure 2.9 – Linearized representation of the states

Now that we have seen how to implement temporal difference learning, let's move on to building Monte Carlo algorithms.

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
TensorFlow 2 Reinforcement Learning Cookbook
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon