Getting started with reinforcement learning

Reinforcement learning aims to create algorithms that can learn and adapt to environmental changes. This programming technique is based on the concept of receiving external stimuli that depend on the actions chosen by the agent. A correct choice will involve a reward, while an incorrect choice will lead to a penalty. The goal of the system is to achieve the best possible result, of course.

These mechanisms derive from the basic concepts of machine learning (learning from experience), in an attempt to simulate human behavior. In fact, in our mind, we activate brain mechanisms that lead us to chase and repeat what, in us, produces feelings of gratification and wellbeing. Whenever we experience moments of pleasure (food, sex, love, and so on), some substances are produced in our brains that work by reinforcing that same stimulus, emphasizing it.

Along with this mechanism of neurochemical reinforcement, an important role is represented by memory. In fact, the memory collects the experience of the subject in order to be able to repeat it in the future. Evolution has endowed us with this mechanism to push us to repeat gratifying experiences in the direction of the best solutions.

This is why we so powerfully remember the most important experiences of our life: experiences, especially those that are powerfully rewarding, are impressed in memories and condition our future explorations. Previously, we have seen that learning from experience can be simulated by a numerical algorithm in various ways, depending on the nature of the signal used for learning and the type of feedback adopted by the system.

The following diagram shows a flowchart that displays an agent's interaction with the environment in a reinforcement learning setting:

Scientific literature has taken an uncertain stance on the classification of learning by reinforcement as a paradigm. In fact, in the initial phase of the literature, it was considered a special case of supervised learning, after which it was fully promoted as the third paradigm of machine learning algorithms. It is applied in different contexts in which supervised learning is inefficient: the problems of interaction with the environment are a clear example.

The following list shows the steps to follow to correctly apply a reinforcement learning algorithm:

Preparation of the agent
Observation of the environment
Selection of the optimal strategy
Execution of actions
Calculation of the corresponding reward (or penalty)
Development of updating strategies (if necessary)
Repetition of steps 2 through 5 iteratively until the agent learns the optimal strategies

Reinforcement learning is based on a theory from psychology, elaborated following a series of experiments performed on animals. In particular, Edward Thorndike (American psychologist) noted that if a cat is given a reward immediately after the execution of a behavior considered correct, then this increases the probability that this behavior will repeat itself. On the other hand, in the face of unwanted behavior, the application of a punishment decreases the probability of a repetition of the error.

On the basis of this theory, after defining a goal to be achieved, reinforcement learning tries to maximize the rewards received for the execution of the action or set of actions that allows to reach the designated goal.

Agent-environment interface

Reinforcement learning can be seen as a special case of the interaction problem, in terms of achieving a goal. The entity that must reach the goal is called an agent. The entity with which the agent must interact is called the environment, which corresponds to everything that is external to the agent.

So far, we are more focused on the term agent, but what does it represent? The agent (software) is a software entity that performs services on behalf of another program, usually automatically and invisibly. These pieces of software are also called smart agents.

What follows is a list of the most important features of an agent:

It can choose between a continuous and a discrete set for an action on the environment.
The action depends on the situation. The situation is summarized in the system state.
The agent continuously monitors the environment (input) and continuously changes the status
The choice of the action is not trivial and requires a certain degree of intelligence.
The agent has a smart memory.

The agent has a goal-directed behavior, but acts in an uncertain environment that is not known a priori or only partially known. An agent learns by interacting with the environment. Planning can be developed while learning about the environment through the measurements made by the agent itself. This strategy is close to trial-and-error theory.

Trial and error is a fundamental method of problem solving. It is characterized by repeated, varied attempts that are continued until success, or until the agent stops trying.

The agent-environment interaction is continuous: the agent chooses an action to be taken, and in response, the environment changes state, presenting a new situation to be faced.

In the particular case of reinforcement learning, the environment provides the agent with a reward. It is essential that the source of the reward is the environment to avoid the formation, within the agent, of a personal reinforcement mechanism that would compromise learning.

The value of the reward is proportional to the influence that the action has in reaching the objective, so it is positive or high in the case of a correct action, or negative or low for an incorrect action.

In the following list are some examples of real life in which there is an interaction between agent and environment to solve a problem:

A chess player, for each move, has information on the configurations of pieces that can be created, and on the possible countermoves of the opponent.
A little giraffe, in just a few hours, learns to get up and run.
A truly autonomous robot learns to move around a room to get out of it. For example: Roomba Robot Vacuum.
The parameters of a refinery (oil pressure, flow, and so on) are set in real time, so as to obtain the maximum yield or maximum quality. For example, if particularly dense oil arrives, then the flow rate to the plant is modified to allow an adequate refining.

All the examples that we examined have the following characteristics in common:

Interaction with the environment
A specific goal that the agent wants to get
Uncertainty or partial knowledge of the environment

From the analysis of these examples, it is possible to make the following observations:

The agent learns from its own experience.
The actions change the status (the situation), the possibilities of choice in the future change (delayed reward).
The effect of an action cannot be completely predicted.
The agent has a global assessment of its behavior.
It must exploit this information to improve its choices. Choices improve with experience.
Problems can have a finite or infinite time horizon.

Essentially, the agent receives sensations from the environment through its sensors. Depending on its feelings, the agent decides what actions to take in the environment. Based on the immediate result of its actions, the agent can be rewarded.

If you want to use an automatic learning method, you need to give a formal description of the environment. It is not important to know exactly how the environment is made; what is interesting is to make general assumptions about the properties that the environment has. In reinforcement learning, it is usually assumed that the environment can be described by a MDP.

Markov Decision Process

To avoid load problems and computational difficulties, the agent-environment interaction is considered an MDP. An MDP is a discrete-time stochastic control process.

Stochastic processes are mathematical models used to study the evolution of phenomena following random or probabilistic laws. It is known that in all natural phenomena, both by their very nature and by observational errors, a random or accidental component is present. This component causes the following: at every instance of t, the result of the observation on the phenomenon is a random number or random variable s_t. It is not possible to predict with certainty what the result will be; one can only state that it will take one of several possible values, each of which has a given probability.

A stochastic process is called Markovian when, having chosen a certain instance of t for observation, the evolution of the process, starting with t, depends only on t and does not depend in any way on the previous instances. Thus, a process is Markovian when, given the moment of observation, only this instance determines the future evolution of the process, while this evolution does not depend on the past.

In a Markov process, at each time step, the process is in a state s € S, and the decision maker may choose any action a € A that is available in state s. The process responds at the next timestamp by randomly moving into a new state s', and giving the decision maker a corresponding reward r(s,s').

The following diagram shows the agent-environment interaction in a MDP:

The agent-environment interaction shown in the preceding diagram can be schematized as follows:

The agent and the environment interact at discrete intervals over time, t = 0, 1, 2… n.
At each interval, the agent receives a representation of the state s_t of the environment.
Each element s_t ∈ S, where S is the set of possible states.
Once the state is recognized, the agent must take an action a_t ∈ A(s_t), where A(s_t) is the set of possible actions in the state s_t.
The choice of the action to be taken depends on the objective to be achieved and is mapped through the policy indicated with the symbol π (discounted cumulative reward), which associates the action with a_t∈ A(s) for each state s. The term π_t(s,a) represents the probability that action a is carried out in the state s.
During the next time interval t + 1, as part of the consequence of the action a_t, the agent receives a numerical reward r_{t + 1} ∈ R corresponding to the action previously taken a_t.
The consequence of the action represents, instead, the new state s_t. At this point the agent must again code the state and make the choice of the action.
This iteration repeats itself until the achievement of the objective by the agent.

The definition of the status s_{t + 1} depends on the previous state and the action taken (MDP), that is as follows:

s_{t + 1} = δ (s_t,a_t)

Here, δ represents the status function.

In summary:

In an MDP, the agent can perceive the state s ∈ S in which it is and has an A set of actions at its disposal
At each discrete interval of time t, the agent detects the current status s_t and decides to implement an action a_t ∈ A
The environment responds by providing a reward (a reinforcement) r_t = r (st, at) and moving into the state s_{t + 1} = δ (st, at)
The r and δ functions are part of the environment; they depend only on the current state and action (not the previous ones) and are not necessarily known to the agent
The goal of reinforcement learning is to learn a policy that, for each state s in which the system is located, indicates to the agent an action to maximize the total reinforcement received during the entire action sequence

Let's go deeper into some of the terms used:

A reward function defines the goal in a reinforcement learning problem. It maps the detected states of the environment into a single number, thereby defining a reward. As already mentioned, the only goal is to maximize the total reward it receives in the long term. The reward function then defines what the good and bad events are for the agent. The reward function has the need to be correct, and it can be used as a basis for changing the policy. For example, if an action selected by the policy is followed by a low reward, the policy can be changed to select other actions in that situation in the next step.
A policy defines the behavior of the learning agent at a given time. It maps both the detected states of the environment and the actions to take when they are in those states. This corresponds to what, in psychology, would be called a set of rules or associations of stimulus response. The policy is the fundamental part of a reinforcing learning agent, in the sense that it alone is enough to determine behavior.
A value function represents how good a state is for an agent. It is equal to the total reward expected for an agent from the status s. The value function depends on the policy with which the agent selects the actions to be performed.
An action-value function returns the value, that is, the expected return (overall reward) for using action a in a certain state s, following a policy.

Discounted cumulative reward

In the previous section, we said that the goal of reinforcement learning is to learn a policy that, for each state s in which the system is located, indicates to the agent an action to maximize the total reward received during the entire action sequence. How can we maximize the total reinforcement received during the entire sequence of actions?

The total reinforcement derived from the policy is calculated as follows:

Here, r_T represents the reward of the action that drives the environment in the terminal state s_T.

A possible solution to the problem is to associate the action that provides the highest reward to each individual state; that is, we must determine an optimal policy such that the previous quantity is maximized.

For problems that do not reach the goal or terminal state in a finite number of steps (continuing tasks), R_t tends to infinity.

In these cases, the sum of the rewards that one wants to maximize diverges at the infinite, so this approach is not applicable. Then, it is necessary to develop an alternative reinforcement technique.

The technique that best suits the reinforcement learning paradigm turns out to be the discounted cumulative reward, which tries to maximize the following quantity:

Here, γ is called a discount factor and represents the importance for future rewards. This parameter can take the values 0 ≤ γ ≤ 1, with the following value:

If γ <1, the sequence r_t will converge to a finite value
If γ = 0, the agent will have no interest in future rewards, but will try to maximize the reward only for the current state
If γ = 1, the agent will try to increase future rewards even at the expense of the immediate ones

The discount factor can be modified during the learning process to highlight particular actions or states. An optimal policy can lead to the reinforcement obtained in performing a single action to be low (or even negative), provided that this leads to greater reinforcement overall.

Exploration versus exploitation

Ideally, the agent must associate with each action a_t the respective reward r, in order to then choose the most rewarding behavior for achieving the goal. This approach is therefore impracticable for complex problems in which the number of states is particularly high and, consequently, the possible associations increase exponentially.

This problem is called the exploration-exploitation dilemma. Ideally, the agent must explore all possible actions for each state, finding the one that is actually most rewarded for exploiting in achieving its goal.

Thus, decision-making involves a fundamental choice:

Exploitation: Make the best decision, given current information
Exploration: Collect more information

In this process, the best long-term strategy can lead to considerable sacrifices in the short term. Therefore, it is necessary to gather enough information to make the best decisions.

The exploration-exploitation dilemma makes itself known whenever we try to learn something new. Often, we have to decide whether to choose what we already know (exploitation), leaving our cultural baggage unaltered, or choosing something new and learning more in this way (exploration). The second choice puts us at the risk of making the wrong choices. This is an experience that we often face; think, for example, about the choices we make in a restaurant when we are asked to choose between the dishes on the menu:

We can choose something that we already know and that, in the past, has given us back a known reward with gratification (exploitation), such as pizza (who does not know the goodness of a margherita pizza?)
We can try something new that we have never tasted before and see what we get (exploration), such as lasagna (alas, not everyone knows the magic taste of a plate of lasagna)

The choice we will make will depend on many boundary conditions: the price of the dishes, the level of hunger, knowledge of the dishes, and so on. What is important is that the study of the best way to make these kinds of choices has demonstrated that optimal learning sometimes requires us to make bad choices. This means that, sometimes, you have to choose to avoid the action you deem most rewarding and take an action that you feel is less rewarding. The logic is that these actions are necessary to obtain a long-term benefit: sometimes, you need to get your hands dirty to learn more.

The following are more examples of adopting this technique for real-life cases:

Selection of a store:
- Exploitation: Go to your favorite store
- Exploration: Try a new store
Choice of a route:
- Exploitation: Choose the best route so far
- Exploration: Try a new route

In practice, in very complex problems, convergence to a very good strategy would be too slow.

A good solution to the problem is to find a balance between exploration and exploitation:

An agent that limits itself to exploring will always act in a casual way in every state, and it is evident that the convergence to an optimal strategy is impossible
If an agent explores little, it will always use the usual actions, which may not be optimal ones

Finally, we can say that at every step the agent has to choose between repeating what it has done so far, or trying out new movements that could achieve better results.

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Keras Reinforcement Learning Projects

By : Giuseppe Ciaburro

Keras Reinforcement Learning Projects

By: Giuseppe Ciaburro

Overview of this book

Getting started with reinforcement learning

Agent-environment interface

Markov Decision Process

Discounted cumulative reward

Exploration versus exploitation

Keras Reinforcement Learning Projects

By : Giuseppe Ciaburro

Keras Reinforcement Learning Projects

By: Giuseppe Ciaburro

Overview of this book

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access