Implementing actor-critic RL algorithms

Actor-critic algorithms allow us to combine value-based and policy-based reinforcement learning – an all-in-one agent. While policy gradient methods directly search and optimize the policy in the policy space, leading to smoother learning curves and improvement guarantees, they tend to get stuck at the local maxima (for a long-term reward optimization objective). Value-based methods do not get stuck at local optimum values, but they lack convergence guarantees, and algorithms such as Q-learning tend to have high variance and are not very sample-efficient. Actor-critic methods combine the good qualities of both value-based and policy gradient-based algorithms. Actor-critic methods are also more sample-efficient. This recipe will make it easy for you to implement an actor-critic-based RL agent using TensorFlow 2.x. Upon completing this recipe, you will be able to train the actor-critic agent in any OpenAI Gym-compatible reinforcement learning environment. As an example, we will train the agent in the CartPole-V0 environment.

Getting ready

To complete this recipe, you will need to activate the tf2rl-cookbook Python/conda virtual environment and run pip install -r requirements.txt. If the following import statements run without issues, you are ready to get started:

import numpy as np
import tensorflow as tf
import gym
import tensorflow_probability as tfp

Now, let's begin.

How to do it…

There are three main parts to this recipe. The first is creating the actor-critic model, which is going to be represented using a neural network implemented in TensorFlow 2.x. The second part is creating the Agent class' implementation, while the final part is going to be about creating a trainer function that will train the policy gradient-based agent in a given RL environment.

Let's start implementing the parts one by one:

Let's begin with our implementation of the ActorCritic class:

class ActorCritic(tf.keras.Model):
    def __init__(self, action_dim):
        super().__init__()
        self.fc1 = tf.keras.layers.Dense(512, \
                                        activation="relu")
        self.fc2 = tf.keras.layers.Dense(128, \
                                        activation="relu")
        self.critic = tf.keras.layers.Dense(1, \
                                          activation=None)
        self.actor = tf.keras.layers.Dense(action_dim, \
                                         activation=None)

The final thing we need to do in the ActorCritic class is implement the call function, which performs a forward pass through the neural network model:

    def call(self, input_data):
        x = self.fc1(input_data)
        x1 = self.fc2(x)
        actor = self.actor(x1)
        critic = self.critic(x1)
        return critic, actor

With the ActorCritic class defined, we can move on and implement the Agent class and initialize an ActorCritic model, along with an optimizer to update the parameters of the actor-critic model:

class Agent:
    def __init__(self, action_dim=4, gamma=0.99):
        """Agent with a neural-network brain powered 
           policy
        Args:
            action_dim (int): Action dimension
            gamma (float) : Discount factor. Default=0.99
        """
        self.gamma = gamma
        self.opt = tf.keras.optimizers.Adam(
                                      learning_rate=1e-4)
        self.actor_critic = ActorCritic(action_dim)

Next, we must implement the agent's get_action method:

    def get_action(self, state):
        _, action_probabilities = \
                     self.actor_critic(np.array([state]))
        action_probabilities = tf.nn.softmax(
                                    action_probabilities)
        action_probabilities = \
                             action_probabilities.numpy()
        dist = tfp.distributions.Categorical(
            probs=action_probabilities, dtype=tf.float32
        )
        action = dist.sample()
        return int(action.numpy()[0])

Now, let's implement a function that will calculate the actor loss based on the actor-critic algorithm. This will drive the parameters of the actor-critic network and allow the agent to improve:

    def actor_loss(self, prob, action, td):
        prob = tf.nn.softmax(prob)
        dist = tfp.distributions.Categorical(probs=prob,
                                       dtype=tf.float32)
        log_prob = dist.log_prob(action)
        loss = -log_prob * td
        return loss

We are now ready to implement the learning function of the actor-critic agent:

def learn(self, state, action, reward, next_state, done):
        state = np.array([state])
        next_state = np.array([next_state])
        with tf.GradientTape() as tape:
            value, action_probabilities = \
                self.actor_critic(state, training=True)
            value_next_st, _ = self.actor_critic(
                               next_state, training=True)
            td = reward + self.gamma * value_next_st * \
                  (1 - int(done)) - value
            actor_loss = self.actor_loss(
                        action_probabilities, action, td)
            critic_loss = td ** 2
            total_loss = actor_loss + critic_loss
        grads = tape.gradient(total_loss, 
                   self.actor_critic.trainable_variables)
        self.opt.apply_gradients(zip(grads, 
                  self.actor_critic.trainable_variables))
        return total_loss

Now, let's define the training function for training the agent in a given RL environment:

def train(agent, env, episodes, render=True):
    """Train `agent` in `env` for `episodes`
    Args:
        agent (Agent): Agent to train
        env (gym.Env): Environment to train the agent
        episodes (int): Number of episodes to train
        render (bool): True=Enable/False=Disable \
                        rendering; Default=True
    """
    for episode in range(episodes):
        done = False
        state = env.reset()
        total_reward = 0
        all_loss = []
        while not done:
            action = agent.get_action(state)
            next_state, reward, done, _ = \
                                      env.step(action)
            loss = agent.learn(state, action, reward, 
                               next_state, done)
            all_loss.append(loss)
            state = next_state
            total_reward += reward
            if render:
                env.render()
            if done:
                print("\n")
            print(f"Episode#:{episode} 
                    ep_reward:{total_reward}", 
                    end="\r")

The final step is to implement the main function, which will call the trainer to train the agent for the specified number of episodes:
```
if __name__ == "__main__":
    env = gym.make("CartPole-v0")
    agent = Agent(env.action_space.n)
    num_episodes = 20000
    train(agent, env, num_episodes)
```
Once the agent has been sufficiently trained, you will see that the agent is able to balance the pole on the cart pretty well, as shown in the following diagram:

Figure 2.22 – Actor-critic agent solving the CartPole task

How it works…

In this recipe, we defined a neural network-based actor-critic model using TensorFlow 2.x's Keras API. In the neural network model, we defined two fully connected or dense neural network layers to extract features from the input. This produced two outputs corresponding to the output for an actor and an output for the critic. The critic's output is a single float value, whereas the actor's output represents the logits for each of the allowed actions in a given RL environment.

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

TensorFlow 2 Reinforcement Learning Cookbook

By : Palanisamy

TensorFlow 2 Reinforcement Learning Cookbook

By: Palanisamy

Overview of this book

Implementing actor-critic RL algorithms

Getting ready

How to do it…

How it works…

TensorFlow 2 Reinforcement Learning Cookbook

By : Palanisamy

TensorFlow 2 Reinforcement Learning Cookbook

By: Palanisamy

Overview of this book

Implementing actor-critic RL algorithms

Getting ready

How to do it…

How it works…

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access