Reinforcement learning is a branch of artificial intelligence that deals with an agent that perceives the information of the environment in the form of state spaces and action spaces, and acts on the environment thereby resulting in a new state and receiving a reward as feedback for that action. This received reward is assigned to the new state. Just like when we had to minimize the cost function in order to train our neural network, here the reinforcement learning agent has to maximize the overall reward to find the the optimal policy to solve a particular task.

How this is different from supervised and unsupervised learning?

In supervised learning, the training dataset has input features, *X*, and their corresponding output labels, *Y*. A model is trained on this training dataset, to which test cases having input features, *X'*, are given as the input and the model predicts *Y'*.

In unsupervised learning, input features, *X*, of the training set are given for the training purpose. There are no associated *Y* values. The goal is to create a model that learns to segregate the data into different clusters by understanding the underlying pattern and thereby, classifying them to find some utility. This model is then further used for the input features *X'* to predict their similarity to one of the clusters.

Reinforcement learning is different from both supervised and unsupervised. Reinforcement learning can guide an agent on how to act in the real world. The interface is broader than the training vectors, like in supervised or unsupervised learning. Here is the entire environment, which can be real or a simulated world. Agents are trained in a different way, where the objective is to reach a goal state, unlike the case of supervised learning where the objective is to maximize the likelihood or minimize cost.

Reinforcement learning agents automatically receive the feedback, that is, rewards from the environment, unlike in supervised learning where labeling requires time-consuming human effort. One of the bigger advantage of reinforcement learning is that phrasing any task's objective in the form of a goal helps in solving a wide variety of problems. For example, the goal of a video game agent would be to win the game by achieving the highest score. This also helps in discovering new approaches to achieving the goal. For example, when AlphaGo became the world champion in Go, it found new, unique ways of winning.

A reinforcement learning agent is like a human. Humans evolved very slowly; an agent reinforces, but it can do that very fast. As far as sensing the environment is concerned, neither humans nor and artificial intelligence agents can sense the entire world at once. The perceived environment creates a state in which agents perform actions and land in a new state, that is, a newly-perceived environment different from the earlier one. This creates a state space that can be finite as well as infinite.

The largest sector interested in this technology is defense. Can reinforcement learning agents replace soldiers that not only walk, but fight, and make important decisions?

The following are the basic terminologies associated with reinforcement learning:

**Agent**: This we create by programming such that it is able to sense the environment, perform actions, receive feedback, and try to maximize rewards.**Environment**: The world where the agent resides. It can be real or simulated.**State**: The perception or configuration of the environment that the agent senses. State spaces can be finite or infinite.**Rewards**: Feedback the agent receives after any action it has taken. The goal of the agent is to maximize the overall reward, that is, the immediate and the future reward. Rewards are defined in advance. Therefore, they must be created properly to achieve the goal efficiently.**Actions**: Anything that the agent is capable of doing in the given environment. Action space can be finite or infinite.**SAR triple**: (state, action, reward) is referred as the SAR triple, represented as (s, a, r).**Episode**: Represents one complete run of the whole task.

Let's deduce the convention shown in the following diagram:

Every task is a sequence of SAR triples. We start from state S(t), perform action A(t) and thereby, receive a reward R(t+1), and land on a new state *S(t+1)*. The current state and action pair gives rewards for the next step. Since, *S(t)* and *A(t)* results in *S(t+1)*, we have a new triple of (current state, action, new state), that is, *[S(t),A(t),S(t+1)] or (s,a,s')*.

The optimality criteria are a measure of goodness of fit of the model created over the data. For example, in supervised classification learning algorithms, we have maximum likelihood as the optimality criteria. Thus, on the basis of the problem statement and objective optimality criteria differs. In reinforcement learning, our major goal is to maximize the future rewards. Therefore, we have two different optimality criteria, which are:

**Value function**: To quantify a state on the basis of future probable rewards**Policy**: To guide an agent on what action to take in a given state

We will discuss both of them in detail in the coming topics.

Agents should be able to think about both immediate and future rewards. Therefore, a value is assigned to each encountered state that reflects this future information too. This is called value function. Here comes the concept of delayed rewards, where being at present what actions taken now will lead to potential rewards in future.

V(s), that is, value of the state is defined as the expected value of rewards to be received in future for all the actions taken from this state to subsequent states until the agent reaches the goal state. Basically, value functions tell us how good it is to be in this state. The higher the value, the better the state.

Rewards assigned to each (s,a,s') triple is fixed. This is not the case with the value of the state; it is subjected to change with every action in the episode and with different episodes too.

One solution comes in mind, instead of the value function, why don't we store the knowledge of every possible state?

The answer is simple: it's time-consuming and expensive, and this cost grows exponentially. Therefore, it's better to store the knowledge of the current state, that is, V(s):

*V(s) = E[all future rewards discounted | S(t)=s]*

More details on the value function will be covered in Chapter 3, *The Markov Decision Process and Partially Observable MDP*.

Policy is defined as the model that guides the agent with action selection in different states. Policy is denoted as

.

is basically the probability of a certain action given a particular state:

Thus, a policy map will provide the set of probabilities of different actions given a particular state. The policy along with the value function create a solution that helps in agent navigation as per the policy and the calculated value of the state.

Q-learning is an attempt to learn the value *Q(s,a)* of a specific action given to the agent in a particular state. Consider a table where the number of rows represent the number of states, and the number of columns represent the number of actions. This is called a Q-table. Thus, we have to learn the value to find which action is the best for the agent in a given state.

Steps involved in Q-learning:

Initialize the table of

*Q(s,a)*with uniform values (say, all zeros).Observe the current state,

*s*Choose an action,

*a*, by epsilon greedy or any other action selection policies, and take the actionAs a result,

*a*reward,*r*, is received and a new state,*s'*, is perceivedUpdate the

*Q*value of the*(s,a)*pair in the table by using the following Bellman equation:

,where

is the discounting factor

Then, set the value of current state as a new state and repeat the process to complete one episode, that is, reaches the terminal state

Run multiple episodes to train the agent

To simplify, we can say that the Q-value for a given state, *s*, and action, *a*, is updated by the sum of current reward, *r*, and the discounted (

) maximum *Q* value for the new state among all its actions. The discount factor delays the reward from the future compared to the present rewards. For example, a reward of 100 today will be worth more than 100 in the future. Similarly, a reward of 100 in the future must be worth less than 100 today. Therefore, we will discount the future rewards. Repeating this update process continuously results in Q-table values converging to accurate measures of the expected future reward for a given action in a given state.

When the volume of the state and action spaces increase, maintaining a Q-table is difficult. In the real world, the state spaces are infinitely large. Thus, there's a requirement of another approach that can produce *Q(s,a)* without a Q-table. One solution is to replace the Q-table with a function. This function will take the state as the input in the form of a vector, and output the vector of Q-values for all the actions in the given state. This function approximator can be represented by a neural network to predict the Q-values. Thus, we can add more layers and fit in a deep neural network for better prediction of Q-values when the state and action space becomes large, which seemed impossible with a Q-table. This gives rise to the Q-network and if a deeper neural network, such as a convolutional neural network, is used then it results in a **deep Q-network**(**DQN**).

More details on Q-learning and deep Q-networks will be covered inChapter 5, *Q-Learning and Deep Q-Networks*.

The A3C algorithm was published in June 2016 by the combined team of Google DeepMind and MILA. It is simpler and has a lighter framework that used the asynchronous gradient descent to optimize the deep neural network. It was faster and was able to show good results on the multi-core CPU instead of GPU. One of A3C's big advantages is that it can work on continuous as well as discrete action spaces. As a result, it has opened the gateway for many new challenging problems that have complex state and action spaces.

We will discuss it at a high note here, but we will dig deeper in Chapter 6, *Asynchronous Methods*. Let's start with the name, that is, **asynchronous advantage actor-critic** (**A3C**) algorithm and unpack it to get the basic overview of the algorithm:

**Asynchronous**: In DQN, you remember we used a neural network with our agent to predict actions. This means there is one agent and it's interacting with one environment. What A3C does is create multiple copies of the agent-environment to make the agent learn more efficiently. A3C has a global network, and multiple worker agents, where each agent has its own set of network parameters and each of them interact with their copy of the environment simultaneously without interacting with another agent's environment. The reason this works better than a single agent is that the experience of each agent is independent of the experience of the other agents. Thus, the overall experience from all the worker agents results in diverse training.**Actor-critic**: Actor-critic combines the benefits of both value iteration and policy iteration. Thus, the network will estimate both a value function, V(s), and a policy, π(s), for a given state,. There will be two separate fully-connected layers at the top of the function approximator neural network that will output the value and policy of the state, respectively. The agent uses the value, which acts as a critic to update the policy, that is, the intelligent actor.`s`

**Advantage**: Policy gradients used discounted returns telling the agent whether the action was good or bad. Replacing that with Advantage not only quantifies the the good or bad status of the action but helps in encouraging and discouraging actions better(we will discuss this in Chapter 4,*Policy Gradients*).