Book Image

TensorFlow 2 Reinforcement Learning Cookbook

By : Palanisamy P
Book Image

TensorFlow 2 Reinforcement Learning Cookbook

By: Palanisamy P

Overview of this book

With deep reinforcement learning, you can build intelligent agents, products, and services that can go beyond computer vision or perception to perform actions. TensorFlow 2.x is the latest major release of the most popular deep learning framework used to develop and train deep neural networks (DNNs). This book contains easy-to-follow recipes for leveraging TensorFlow 2.x to develop artificial intelligence applications. Starting with an introduction to the fundamentals of deep reinforcement learning and TensorFlow 2.x, the book covers OpenAI Gym, model-based RL, model-free RL, and how to develop basic agents. You'll discover how to implement advanced deep reinforcement learning algorithms such as actor-critic, deep deterministic policy gradients, deep-Q networks, proximal policy optimization, and deep recurrent Q-networks for training your RL agents. As you advance, you’ll explore the applications of reinforcement learning by building cryptocurrency trading agents, stock/share trading agents, and intelligent agents for automating task completion. Finally, you'll find out how to deploy deep reinforcement learning agents to the cloud and build cross-platform apps using TensorFlow 2.x. By the end of this TensorFlow book, you'll have gained a solid understanding of deep reinforcement learning algorithms and their implementations from scratch.
Table of Contents (11 chapters)

Building an environment and reward mechanism for training RL agents

This recipe will walk you through the steps to build a Gridworld learning environment to train RL agents. Gridworld is a simple environment where the world is represented as a grid. Each location on the grid can be referred to as a cell. The goal of an agent in this environment is to find its way to the goal state in a grid like the one shown here:

Figure 1.1 – A screenshot of the Gridworld environment

Figure 1.1 – A screenshot of the Gridworld environment

The agent's location is represented by the blue cell in the grid, while the goal and a mine/bomb/obstacle's location is represented in the grid using green and red cells, respectively. The agent (blue cell) needs to find its way through the grid to reach the goal (green cell) without running over the mine/bomb (red cell).

Getting ready

To complete this recipe, you will first need to activate the tf2rl-cookbook Python/Conda virtual environment and pip install numpy gym. If the following import statements run without issues, you are ready to get started!

import copy
import sys
import gym
import numpy as np

Now we can begin.

How to do it…

To train RL agents, we need a learning environment that is akin to the datasets used in supervised learning. The learning environment is a simulator that provides the observation for the RL agent, supports a set of actions that the RL agent can perform by executing the actions, and returns the resultant/new observation as a result of the agent taking the action.

Perform the following steps to implement a Gridworld learning environment that represents a simple 2D map with colored cells representing the location of the agent, goal, mine/bomb/obstacle, wall, and empty space on a grid:

  1. We'll start by first defining the mapping between different cell states and their color codes to be used in the Gridworld environment:
    EMPTY = BLACK = 0
    WALL = GRAY = 1
    AGENT = BLUE = 2
    MINE = RED = 3
    GOAL = GREEN = 4
    SUCCESS = PINK = 5
  2. Next, generate a color map using RGB intensity values:
    COLOR_MAP = {
        BLACK: [0.0, 0.0, 0.0],
        GRAY: [0.5, 0.5, 0.5],
        BLUE: [0.0, 0.0, 1.0],
        RED: [1.0, 0.0, 0.0],
        GREEN: [0.0, 1.0, 0.0],
        PINK: [1.0, 0.0, 1.0],
    }
  3. Let's now define the action mapping:
    NOOP = 0
    DOWN = 1
    UP = 2
    LEFT = 3
    RIGHT = 4
  4. Let's then create a GridworldEnv class with an __init__ function to define necessary class variables, including the observation and action space:
    class GridworldEnv():
    	def __init__(self):

    We will implement __init__() in the following steps.

  5. In this step, let's define the layout of the Gridworld environment using the grid cell state mapping:
    	self.grid_layout = """
            1 1 1 1 1 1 1 1
            1 2 0 0 0 0 0 1
            1 0 1 1 1 0 0 1
            1 0 1 0 1 0 0 1
            1 0 1 4 1 0 0 1
            1 0 3 0 0 0 0 1
            1 0 0 0 0 0 0 1
            1 1 1 1 1 1 1 1
            """

    In the preceding layout, 0 corresponds to the empty cells, 1 corresponds to walls, 2 corresponds to the agent's starting location, 3 corresponds to the location of the mine/bomb/obstacle, and 4 corresponds to the goal location based on the mapping we defined in step 1.

  6. Now, we are ready to define the observation space for the Gridworld RL environment:
    	self.initial_grid_state = np.fromstring(
                        self.grid_layout, dtype=int, sep=" ")
    	self.initial_grid_state = \
                        self.initial_grid_state.reshape(8, 8)
    	self.grid_state = copy.deepcopy(
                                     self.initial_grid_state)
    	self.observation_space = gym.spaces.Box(
    		low=0, high=6, shape=self.grid_state.shape
    	)
    	self.img_shape = [256, 256, 3]
    	self.metadata = {"render.modes": ["human"]}
  7. Let's define the action space and the mapping between the actions and the movement of the agent in the grid:
    	   self.action_space = gym.spaces.Discrete(5)
            self.actions = [NOOP, UP, DOWN, LEFT, RIGHT]
            self.action_pos_dict = {
                NOOP: [0, 0],
                UP: [-1, 0],
                DOWN: [1, 0],
                LEFT: [0, -1],
                RIGHT: [0, 1],
            }
  8. Let's now wrap up the __init__ function by initializing the agent's start and goal states using the get_state() method (which we will implement in the next step):
    (self.agent_start_state, self.agent_goal_state,) = \
                                             self.get_state()
  9. Now we need to implement the get_state() method that returns the start and goal state for the Gridworld environment:
    def get_state(self):
            start_state = np.where(self.grid_state == AGENT)
            goal_state = np.where(self.grid_state == GOAL)
            start_or_goal_not_found = not (start_state[0] \
                                           and goal_state[0])
            if start_or_goal_not_found:
                sys.exit(
                    "Start and/or Goal state not present in 
                     the Gridworld. "
                    "Check the Grid layout"
                )
            start_state = (start_state[0][0], 
                           start_state[1][0])
            goal_state = (goal_state[0][0], goal_state[1][0])
            return start_state, goal_state
  10. In this step, we will be implementing the step(action) method to execute the action and retrieve the next state/observation, the associated reward, and whether the episode ended:
    def step(self, action):
            """return next observation, reward, done, info"""
            action = int(action)
            info = {"success": True}
            done = False
            reward = 0.0
            next_obs = (
                self.agent_state[0] + \
                    self.action_pos_dict[action][0],
                self.agent_state[1] + \
                    self.action_pos_dict[action][1],
            )
  11. Next, let's specify the rewards and finally, return grid_state, reward, done, and info:
     # Determine the reward
            if action == NOOP:
                return self.grid_state, reward, False, info
            next_state_valid = (
                next_obs[0] < 0 or next_obs[0] >= \
                                    self.grid_state.shape[0]
            ) or (next_obs[1] < 0 or next_obs[1] >= \
                                    self.grid_state.shape[1])
            if next_state_valid:
                info["success"] = False
                return self.grid_state, reward, False, info
            next_state = self.grid_state[next_obs[0], 
                                         next_obs[1]]
            if next_state == EMPTY:
                self.grid_state[next_obs[0], 
                                next_obs[1]] = AGENT
            elif next_state == WALL:
                info["success"] = False
                reward = -0.1
                return self.grid_state, reward, False, info
            elif next_state == GOAL:
                done = True
                reward = 1
            elif next_state == MINE:
                done = True
                reward = -1        # self._render("human")
            self.grid_state[self.agent_state[0], 
                            self.agent_state[1]] = EMPTY
            self.agent_state = copy.deepcopy(next_obs)
            return self.grid_state, reward, done, info
  12. Up next is the reset() method, which resets the Gridworld environment when an episode completes (or if a request to reset the environment is made):
    def reset(self):
            self.grid_state = copy.deepcopy(
                                     self.initial_grid_state)
            (self.agent_state, self.agent_goal_state,) = \
                                             self.get_state()
            return self.grid_state
  13. To visualize the state of the Gridworld environment in a human-friendly manner, let's implement a render function that will convert the grid_layout that we defined in step 5 to an image and display it. With that, the Gridworld environment implementation will be complete!
    def gridarray_to_image(self, img_shape=None):
            if img_shape is None:
                img_shape = self.img_shape
            observation = np.random.randn(*img_shape) * 0.0
            scale_x = int(observation.shape[0] / self.grid_\
                                             state.shape[0])
            scale_y = int(observation.shape[1] / self.grid_\
                                             state.shape[1])
            for i in range(self.grid_state.shape[0]):
                for j in range(self.grid_state.shape[1]):
                    for k in range(3):  # 3-channel RGB image
                        pixel_value = \
                          COLOR_MAP[self.grid_state[i, j]][k]
                        observation[
                            i * scale_x : (i + 1) * scale_x,
                            j * scale_y : (j + 1) * scale_y,
                            k,
                        ] = pixel_value
            return (255 * observation).astype(np.uint8)
        def render(self, mode="human", close=False):
            if close:
                if self.viewer is not None:
                    self.viewer.close()
                    self.viewer = None
                return
            img = self.gridarray_to_image()
            if mode == "rgb_array":
                return img
            elif mode == "human":
                from gym.envs.classic_control import \
                   rendering
                if self.viewer is None:
                    self.viewer = \
                            rendering.SimpleImageViewer()
                self.viewer.imshow(img)
  14. To test whether the environment is working as expected, let's add a __main__ function that gets executed if the environment script is run directly:
    if __name__ == "__main__":
    	env = GridworldEnv()
    	obs = env.reset()
    	# Sample a random action from the action space
    	action = env.action_space.sample()
    	next_obs, reward, done, info = env.step(action)
    	print(f"reward:{reward} done:{done} info:{info}")
    	env.render()
    	env.close()
  15. All set! The Gridworld environment is ready and we can quickly test it by running the script (python envs/gridworld.py). An output such as the following will be displayed:
    reward:0.0 done:False info:{'success': True}

    The following rendering of the Gridworld environment will also be displayed:

Figure 1.2 – The Gridworld

Figure 1.2 – The Gridworld

Let's now see how it works!

How it works…

The grid_layout defined in step 5 in the How to do it… section represents the state of the learning environment. The Gridworld environment defines the observation space, action spaces, and the rewarding mechanism to implement a Markov Decision Process (MDP). We sample a valid action from the action space of the environment and step the environment with the chosen action, which results in the new observation, reward, and a done status Boolean (representing if the episode has finished) as the response from the Gridworld environment. The env.render() method converts the environment's internal grid representation to an image and displays it for visual understanding.