Book Image

Keras Reinforcement Learning Projects

By : Giuseppe Ciaburro
Book Image

Keras Reinforcement Learning Projects

By: Giuseppe Ciaburro

Overview of this book

Reinforcement learning has evolved a lot in the last couple of years and proven to be a successful technique in building smart and intelligent AI networks. Keras Reinforcement Learning Projects installs human-level performance into your applications using algorithms and techniques of reinforcement learning, coupled with Keras, a faster experimental library. The book begins with getting you up and running with the concepts of reinforcement learning using Keras. You’ll learn how to simulate a random walk using Markov chains and select the best portfolio using dynamic programming (DP) and Python. You’ll also explore projects such as forecasting stock prices using Monte Carlo methods, delivering vehicle routing application using Temporal Distance (TD) learning algorithms, and balancing a Rotating Mechanical System using Markov decision processes. Once you’ve understood the basics, you’ll move on to Modeling of a Segway, running a robot control system using deep reinforcement learning, and building a handwritten digit recognition model in Python using an image dataset. Finally, you’ll excel in playing the board game Go with the help of Q-Learning and reinforcement learning algorithms. By the end of this book, you’ll not only have developed hands-on training on concepts, algorithms, and techniques of reinforcement learning but also be all set to explore the world of AI.
Table of Contents (13 chapters)

Getting started with reinforcement learning

Reinforcement learning aims to create algorithms that can learn and adapt to environmental changes. This programming technique is based on the concept of receiving external stimuli that depend on the actions chosen by the agent. A correct choice will involve a reward, while an incorrect choice will lead to a penalty. The goal of the system is to achieve the best possible result, of course.

These mechanisms derive from the basic concepts of machine learning (learning from experience), in an attempt to simulate human behavior. In fact, in our mind, we activate brain mechanisms that lead us to chase and repeat what, in us, produces feelings of gratification and wellbeing. Whenever we experience moments of pleasure (food, sex, love, and so on), some substances are produced in our brains that work by reinforcing that same stimulus, emphasizing it.

Along with this mechanism of neurochemical reinforcement, an important role is represented by memory. In fact, the memory collects the experience of the subject in order to be able to repeat it in the future. Evolution has endowed us with this mechanism to push us to repeat gratifying experiences in the direction of the best solutions.

This is why we so powerfully remember the most important experiences of our life: experiences, especially those that are powerfully rewarding, are impressed in memories and condition our future explorations. Previously, we have seen that learning from experience can be simulated by a numerical algorithm in various ways, depending on the nature of the signal used for learning and the type of feedback adopted by the system.

The following diagram shows a flowchart that displays an agent's interaction with the environment in a reinforcement learning setting:

Scientific literature has taken an uncertain stance on the classification of learning by reinforcement as a paradigm. In fact, in the initial phase of the literature, it was considered a special case of supervised learning, after which it was fully promoted as the third paradigm of machine learning algorithms. It is applied in different contexts in which supervised learning is inefficient: the problems of interaction with the environment are a clear example.

The following list shows the steps to follow to correctly apply a reinforcement learning algorithm:

  1. Preparation of the agent
  2. Observation of the environment
  3. Selection of the optimal strategy
  4. Execution of actions
  5. Calculation of the corresponding reward (or penalty)
  6. Development of updating strategies (if necessary)
  7. Repetition of steps 2 through 5 iteratively until the agent learns the optimal strategies

Reinforcement learning is based on a theory from psychology, elaborated following a series of experiments performed on animals. In particular, Edward Thorndike (American psychologist) noted that if a cat is given a reward immediately after the execution of a behavior considered correct, then this increases the probability that this behavior will repeat itself. On the other hand, in the face of unwanted behavior, the application of a punishment decreases the probability of a repetition of the error.

On the basis of this theory, after defining a goal to be achieved, reinforcement learning tries to maximize the rewards received for the execution of the action or set of actions that allows to reach the designated goal.

Agent-environment interface

Reinforcement learning can be seen as a special case of the interaction problem, in terms of achieving a goal. The entity that must reach the goal is called an agent. The entity with which the agent must interact is called the environment, which corresponds to everything that is external to the agent.

So far, we are more focused on the term agent, but what does it represent? The agent (software) is a software entity that performs services on behalf of another program, usually automatically and invisibly. These pieces of software are also called smart agents.

What follows is a list of the most important features of an agent:

  • It can choose between a continuous and a discrete set for an action on the environment.
  • The action depends on the situation. The situation is summarized in the system state.
  • The agent continuously monitors the environment (input) and continuously changes the status
  • The choice of the action is not trivial and requires a certain degree of intelligence.
  • The agent has a smart memory.

The agent has a goal-directed behavior, but acts in an uncertain environment that is not known a priori or only partially known. An agent learns by interacting with the environment. Planning can be developed while learning about the environment through the measurements made by the agent itself. This strategy is close to trial-and-error theory.

Trial and error is a fundamental method of problem solving. It is characterized by repeated, varied attempts that are continued until success, or until the agent stops trying.

The agent-environment interaction is continuous: the agent chooses an action to be taken, and in response, the environment changes state, presenting a new situation to be faced.

In the particular case of reinforcement learning, the environment provides the agent with a reward. It is essential that the source of the reward is the environment to avoid the formation, within the agent, of a personal reinforcement mechanism that would compromise learning.

The value of the reward is proportional to the influence that the action has in reaching the objective, so it is positive or high in the case of a correct action, or negative or low for an incorrect action.

In the following list are some examples of real life in which there is an interaction between agent and environment to solve a problem:

  • A chess player, for each move, has information on the configurations of pieces that can be created, and on the possible countermoves of the opponent.
  • A little giraffe, in just a few hours, learns to get up and run.
  • A truly autonomous robot learns to move around a room to get out of it. For example: Roomba Robot Vacuum.
  • The parameters of a refinery (oil pressure, flow, and so on) are set in real time, so as to obtain the maximum yield or maximum quality. For example, if particularly dense oil arrives, then the flow rate to the plant is modified to allow an adequate refining.

All the examples that we examined have the following characteristics in common:

  • Interaction with the environment
  • A specific goal that the agent wants to get
  • Uncertainty or partial knowledge of the environment

From the analysis of these examples, it is possible to make the following observations:

  • The agent learns from its own experience.
  • The actions change the status (the situation), the possibilities of choice in the future change (delayed reward).
  • The effect of an action cannot be completely predicted.
  • The agent has a global assessment of its behavior.
  • It must exploit this information to improve its choices. Choices improve with experience.
  • Problems can have a finite or infinite time horizon.

Essentially, the agent receives sensations from the environment through its sensors. Depending on its feelings, the agent decides what actions to take in the environment. Based on the immediate result of its actions, the agent can be rewarded.

If you want to use an automatic learning method, you need to give a formal description of the environment. It is not important to know exactly how the environment is made; what is interesting is to make general assumptions about the properties that the environment has. In reinforcement learning, it is usually assumed that the environment can be described by a MDP.

Markov Decision Process

To avoid load problems and computational difficulties, the agent-environment interaction is considered an MDP. An MDP is a discrete-time stochastic control process.

Stochastic processes are mathematical models used to study the evolution of phenomena following random or probabilistic laws. It is known that in all natural phenomena, both by their very nature and by observational errors, a random or accidental component is present. This component causes the following: at every instance of t, the result of the observation on the phenomenon is a random number or random variable st. It is not possible to predict with certainty what the result will be; one can only state that it will take one of several possible values, each of which has a given probability.

A stochastic process is called Markovian when, having chosen a certain instance of t for observation, the evolution of the process, starting with t, depends only on t and does not depend in any way on the previous instances. Thus, a process is Markovian when, given the moment of observation, only this instance determines the future evolution of the process, while this evolution does not depend on the past.

In a Markov process, at each time step, the process is in a state s € S, and the decision maker may choose any action a € A that is available in state s. The process responds at the next timestamp by randomly moving into a new state s', and giving the decision maker a corresponding reward r(s,s').

The following diagram shows the agent-environment interaction in a MDP:

The agent-environment interaction shown in the preceding diagram can be schematized as follows:

  • The agent and the environment interact at discrete intervals over time, t = 0, 1, 2… n.
  • At each interval, the agent receives a representation of the state st of the environment.
  • Each element st ∈ S, where S is the set of possible states.
  • Once the state is recognized, the agent must take an action at ∈ A(st), where A(st) is the set of possible actions in the state st.
  • The choice of the action to be taken depends on the objective to be achieved and is mapped through the policy indicated with the symbol π (discounted cumulative reward), which associates the action with at ∈ A(s) for each state s. The term πt(s,a) represents the probability that action a is carried out in the state s.
  • During the next time interval t + 1, as part of the consequence of the action at, the agent receives a numerical reward rt + 1 ∈ R corresponding to the action previously taken at.
  • The consequence of the action represents, instead, the new state st. At this point the agent must again code the state and make the choice of the action.
  • This iteration repeats itself until the achievement of the objective by the agent.

The definition of the status st + 1 depends on the previous state and the action taken (MDP), that is as follows:

st + 1 = δ (st,at)

Here, δ represents the status function.

In summary:

  • In an MDP, the agent can perceive the state s ∈ S in which it is and has an A set of actions at its disposal
  • At each discrete interval of time t, the agent detects the current status st and decides to implement an action at ∈ A
  • The environment responds by providing a reward (a reinforcement) rt = r (st, at) and moving into the state st + 1 = δ (st, at)
  • The r and δ functions are part of the environment; they depend only on the current state and action (not the previous ones) and are not necessarily known to the agent
  • The goal of reinforcement learning is to learn a policy that, for each state s in which the system is located, indicates to the agent an action to maximize the total reinforcement received during the entire action sequence

Let's go deeper into some of the terms used:

  • A reward function defines the goal in a reinforcement learning problem. It maps the detected states of the environment into a single number, thereby defining a reward. As already mentioned, the only goal is to maximize the total reward it receives in the long term. The reward function then defines what the good and bad events are for the agent. The reward function has the need to be correct, and it can be used as a basis for changing the policy. For example, if an action selected by the policy is followed by a low reward, the policy can be changed to select other actions in that situation in the next step.
  • A policy defines the behavior of the learning agent at a given time. It maps both the detected states of the environment and the actions to take when they are in those states. This corresponds to what, in psychology, would be called a set of rules or associations of stimulus response. The policy is the fundamental part of a reinforcing learning agent, in the sense that it alone is enough to determine behavior.
  • A value function represents how good a state is for an agent. It is equal to the total reward expected for an agent from the status s. The value function depends on the policy with which the agent selects the actions to be performed.
  • An action-value function returns the value, that is, the expected return (overall reward) for using action a in a certain state s, following a policy.

Discounted cumulative reward

In the previous section, we said that the goal of reinforcement learning is to learn a policy that, for each state s in which the system is located, indicates to the agent an action to maximize the total reward received during the entire action sequence. How can we maximize the total reinforcement received during the entire sequence of actions?

The total reinforcement derived from the policy is calculated as follows:

Here, rT represents the reward of the action that drives the environment in the terminal state sT.

A possible solution to the problem is to associate the action that provides the highest reward to each individual state; that is, we must determine an optimal policy such that the previous quantity is maximized.

For problems that do not reach the goal or terminal state in a finite number of steps (continuing tasks), Rt tends to infinity.

In these cases, the sum of the rewards that one wants to maximize diverges at the infinite, so this approach is not applicable. Then, it is necessary to develop an alternative reinforcement technique.

The technique that best suits the reinforcement learning paradigm turns out to be the discounted cumulative reward, which tries to maximize the following quantity:

Here, γ is called a discount factor and represents the importance for future rewards. This parameter can take the values 0 ≤ γ ≤ 1, with the following value:

  • If γ <1, the sequence rt will converge to a finite value
  • If γ = 0, the agent will have no interest in future rewards, but will try to maximize the reward only for the current state
  • If γ = 1, the agent will try to increase future rewards even at the expense of the immediate ones

The discount factor can be modified during the learning process to highlight particular actions or states. An optimal policy can lead to the reinforcement obtained in performing a single action to be low (or even negative), provided that this leads to greater reinforcement overall.

Exploration versus exploitation

Ideally, the agent must associate with each action at the respective reward r, in order to then choose the most rewarding behavior for achieving the goal. This approach is therefore impracticable for complex problems in which the number of states is particularly high and, consequently, the possible associations increase exponentially.

This problem is called the exploration-exploitation dilemma. Ideally, the agent must explore all possible actions for each state, finding the one that is actually most rewarded for exploiting in achieving its goal.

Thus, decision-making involves a fundamental choice:

  • Exploitation: Make the best decision, given current information
  • Exploration: Collect more information

In this process, the best long-term strategy can lead to considerable sacrifices in the short term. Therefore, it is necessary to gather enough information to make the best decisions.

The exploration-exploitation dilemma makes itself known whenever we try to learn something new. Often, we have to decide whether to choose what we already know (exploitation), leaving our cultural baggage unaltered, or choosing something new and learning more in this way (exploration). The second choice puts us at the risk of making the wrong choices. This is an experience that we often face; think, for example, about the choices we make in a restaurant when we are asked to choose between the dishes on the menu:

  • We can choose something that we already know and that, in the past, has given us back a known reward with gratification (exploitation), such as pizza (who does not know the goodness of a margherita pizza?)
  • We can try something new that we have never tasted before and see what we get (exploration), such as lasagna (alas, not everyone knows the magic taste of a plate of lasagna)

The choice we will make will depend on many boundary conditions: the price of the dishes, the level of hunger, knowledge of the dishes, and so on. What is important is that the study of the best way to make these kinds of choices has demonstrated that optimal learning sometimes requires us to make bad choices. This means that, sometimes, you have to choose to avoid the action you deem most rewarding and take an action that you feel is less rewarding. The logic is that these actions are necessary to obtain a long-term benefit: sometimes, you need to get your hands dirty to learn more.

The following are more examples of adopting this technique for real-life cases:

  • Selection of a store:
    • Exploitation: Go to your favorite store
    • Exploration: Try a new store
  • Choice of a route:
    • Exploitation: Choose the best route so far
    • Exploration: Try a new route

In practice, in very complex problems, convergence to a very good strategy would be too slow.

A good solution to the problem is to find a balance between exploration and exploitation:

  • An agent that limits itself to exploring will always act in a casual way in every state, and it is evident that the convergence to an optimal strategy is impossible
  • If an agent explores little, it will always use the usual actions, which may not be optimal ones

Finally, we can say that at every step the agent has to choose between repeating what it has done so far, or trying out new movements that could achieve better results.