**Reinforcement learning** is a branch of machine learning that enables machines and/or agents to maximize some form of reward within a specific context by taking specific actions. Reinforcement learning is different from supervised and unsupervised learning. Reinforcement learning is used extensively in game theory, control systems, robotics, and other emerging areas of artificial intelligence. The following diagram illustrates the interaction between an agent and an environment in a reinforcement learning problem:

# Reinforcement learning

# Q-learning

We will now look at a popular reinforcement learning algorithm, called **Q-learning**. Q-learning is used to determine an optimal action selection policy for a given finite Markov decision process. A **Markov decision process** is defined by a state space, *S*; an action space, *A*; an immediate rewards set, *R*; a probability of the next state, *S ^{(t+1)}*, given the current state,

*S*; a current action,

^{(t)}*a*;

^{(t)}*P(S*; and a discount factor, . The following diagram illustrates a Markov decision process, where the next state is dependent on the current state and any actions taken in the current state:

^{(t+1)}/S^{(t)};r^{(t)})Let's suppose that we have a sequence of states, actions, and corresponding rewards, as follows:

If we consider the long term reward, *R _{t}*, at step

*t*, it is equal to the sum of the immediate rewards at each step, from

*t*until the end, as follows:

Now, a Markov decision process is a random process, and it is not possible to get the same next step, *S ^{(t+1)}*, based on

*S*and

^{(t)}*a*every time; so, we apply a discount factor, , to future rewards. This means that, the long-term reward can be better represented as follows:

^{(t)}Since at the time step, *t*, the immediate reward is already realized, to maximize the long-term reward, we need to maximize the long-term reward at the time step *t+1 *(that is, *R _{t+1}*), by choosing an optimal action. The maximum long-term reward expected at a state

*S*by taking an action a

^{(t)}*is represented by the following Q-function:*

^{(t)}At each state, *s ∈ S*, the agent in Q-learning tries to take an action, , that maximizes its long-term reward. The Q-learning algorithm is an iterative process, the update rule of which is as follows:

As you can see, the algorithm is inspired by the notion of a long-term reward, as expressed in *(1)*.

The overall cumulative reward, *Q(s ^{(t)}, a^{(t)})*, of taking action

*a*in state

^{(t)}*s*is dependent on the immediate reward, r

^{(t)}*, and the maximum long-term reward that we can hope for at the new step, s*

^{(t)}*. In a Markov decision process, the new state s*

^{(t+1)}*is stochastically dependent on the current state, s*

^{(t+1)}*, and the action taken a*

^{(t)}*through a probability density/mass function of the form*

^{(t)}*P(S*.

^{(t+1)}/S^{(t)};r^{(t)})The algorithm keeps on updating the expected long-term cumulative reward by taking a weighted average of the old expectation and the new long-term reward, based on the value of .

Once we have built the *Q(s,a)* function through the iterative algorithm, while playing the game based on a given state *s* we can take the best action, , as the policy that maximizes the Q-function:

# Deep Q-learning

In Q-learning, we generally work with a finite set of states and actions; this means that, tables suffice to hold the Q-values and rewards. However, in practical applications, the number of states and applicable actions are mostly infinite, and better Q-function approximators are needed to represent and learn the Q-functions. This is where deep neural networks come to the rescue, since they are universal function approximators. We can represent the Q-function with a neural network that takes the states and actions as input and provides the corresponding Q-values as output. Alternatively, we can train a neural network using only the states, and have the output as Q-values corresponding to all of the actions. Both of these scenarios are illustrated in the following diagram. Since the Q-values are rewards, we are dealing with regression in these networks:

In this book, we will use reinforcement learning to train a race car to drive by itself through deep Q-learning.