Book Image

PyTorch 1.x Reinforcement Learning Cookbook

By : Yuxi (Hayden) Liu
Book Image

PyTorch 1.x Reinforcement Learning Cookbook

By: Yuxi (Hayden) Liu

Overview of this book

Reinforcement learning (RL) is a branch of machine learning that has gained popularity in recent times. It allows you to train AI models that learn from their own actions and optimize their behavior. PyTorch has also emerged as the preferred tool for training RL models because of its efficiency and ease of use. With this book, you'll explore the important RL concepts and the implementation of algorithms in PyTorch 1.x. The recipes in the book, along with real-world examples, will help you master various RL techniques, such as dynamic programming, Monte Carlo simulations, temporal difference, and Q-learning. You'll also gain insights into industry-specific applications of these techniques. Later chapters will guide you through solving problems such as the multi-armed bandit problem and the cartpole problem using the multi-armed bandit algorithm and function approximation. You'll also learn how to use Deep Q-Networks to complete Atari games, along with how to effectively implement policy gradients. Finally, you'll discover how RL techniques are applied to Blackjack, Gridworld environments, internet advertising, and the Flappy Bird game. By the end of this book, you'll have developed the skills you need to implement popular RL algorithms and use RL techniques to solve real-world problems.
Table of Contents (11 chapters)

What this book covers

Chapter 1, Getting Started with Reinforcement Learning and PyTorch, is the starting point for readers who are looking forward to beginning this book's step-by-step guide to reinforcement learning with PyTorch. We will set up the working environment and OpenAI Gym and get familiar with reinforcement learning environments using the Atari and CartPole playgrounds. The chapter will also cover the implementation of several basic reinforcement learning algorithms, including random search, hill-climbing, and policy gradient. At the end, readers will also have a chance to review the essentials of PyTorch and get ready for the upcoming learning examples and projects.

Chapter 2, Markov Decision Process and Dynamic Programming, starts with the creation of a Markov chain and a Markov Decision Process, which is the core of most reinforcement learning algorithms. It will then move on to two approaches to solve a Markov Decision Process (MDP), value iteration and policy iteration. We will get more familiar with MDP and the Bellman equation by practicing policy evaluation. We will also demonstrate how to solve the interesting coin flipping gamble problem step by step. At the end, we will learn how to perform dynamic programming to scale up the learning.

Chapter 3, Monte Carlo Methods for Making Numerical Estimations, is focused on Monte Carlo methods. We will start by estimating the value of pi with Monte Carlo. Moving on, we will learn how to use the Monte Carlo method to predict state values and state-action values. We will demonstrate training an agent to win at Blackjack using Monte Carlo. Also, we will explore on-policy, first-visit Monte Carlo control and off-policy Monte Carlo control by developing various algorithms. Monte Carlo Control with an epsilon-greedy policy and weighted importance sampling will also be covered.

Chapter 4, Temporal Difference and Q-Learning, starts by setting up the CliffWalking and Windy Gridworld environment playground, which will be used in temporal difference and Q-Learning. Through our step-by-step guide, readers will explore Temporal Difference for prediction, and will gain practical experience with Q-Learning for off-policy control, and SARSA for on-policy control. We will also work on an interesting project, the taxi problem, and demonstrate how to solve it using the Q-Learning and SARSA algorithms. Finally, we will cover the Double Q-learning algorithm as a bonus section.

Chapter 5, Solving Multi-Armed Bandit Problems, covers the multi-armed bandit algorithm, which is probably one of the most popular algorithms in reinforcement learning. This will start with the creation of a multi-armed bandit problem. We will see how to solve the multi-armed bandit problem using four strategies, these being the epsilon-greedy policy, softmax exploration, the upper confidence bound algorithm, and the Thompson sampling algorithm. We will also work on a billion-dollar problem, online advertising, and demonstrate how to solve it using the multi-armed bandit algorithm. Finally, we will develop a more complex algorithm, the contextual bandit algorithm, and use it to optimize display advertising.

Chapter 6, Scaling Up Learning with Function Approximation, is focused on function approximation and will start with setting up the Mountain Car environment playground. Through our step-by-step guide, we will cover the motivation for function approximation over Table Lookup, and gain experience in incorporating function approximation into existing algorithms such as Q-Learning and SARSA. We will also cover an advanced technique, batching using experience replay. Finally, we will cover how to solve the CartPole problem using what we have learned in the chapter as a whole.

Chapter 7, Deep Q-Networks in Action, covers Deep Q-Learning, or Deep Q Network (DQN), which is considered the most modern reinforcement learning technique. We will develop a DQN model step by step and understand the importance of Experience Replay and a target network in making Deep Q-Learning work in practice. To help readers solve Atari games, we will demonstrate how to incorporate convolutional neural networks into DQNs. We will also cover two DQN variants, Double DQNs and Dueling DQNs. We will cover how to fine-tune a Q-Learning algorithm using Double DQNs as an example.

Chapter 8, Implementing Policy Gradients and Policy Optimization, focuses on policy gradients and optimization and starts by implementing the REINFORCE algorithm. We will then develop the REINFORCE algorithm with the baseline for CliffWalking. We will also implement the actor-critic algorithm and apply it to solve the CliffWalking problem. To scale up the deterministic policy gradient algorithm, we apply tricks from DQN and develop the Deep Deterministic Policy Gradients. As a bit of fun, we train an agent based on the cross-entropy method to play the CartPole game. Finally, we will talk about how to scale up policy gradient methods using the asynchronous actor-critic method and neural networks.

Chapter 9, Capstone Project – Playing Flappy Bird with DQN, takes us through a capstone project – playing Flappy Bird using reinforcement learning. We will apply what we have learned throughout this book to build an intelligent bot. We will focus on building a DQN, fine-tuning model parameters, and deploying the model. Let's see how long the bird can fly in the air.