Every scientific and engineering field has its own assumptions and limitations. In the previous section, we discussed supervised learning, in which such assumptions are the knowledge of input-output pairs. No labels for your data? Sorry, you need to figure out how to obtain labels or try to use some other theory. It doesn't make supervised learning good or bad, it just makes it inapplicable to your problem. It's important to know and understand those play rules for various methods, as it can save you tons of time in advance. However, we know there are many examples of practical and theoretical breakthroughs, when somebody tried to challenge the rules in a creative way. To do this you should first of all know the limitations.
Of course, such formalisms exist for RL, and now it is the right time to introduce them, as we'll spend the rest of the book analyzing them from various angles. You can see the following diagram showing two major RL entities: Agent and Environment and their communication channels: Actions, Reward, and Observ ations:
The first thing to discuss is a notion of reward. In RL, it's just a scalar value we obtain periodically from the environment. It can be positive or negative, large or small, but it's just a number. The purpose of reward is to tell our agent how well they have behaved. We don't define how frequently the agent receives this reward; it can be every second or once in a lifetime, although it's common practice to receive a reward every fixed timestamp or every environment interaction, just for convenience. In the case of once-in-a-lifetime reward systems, all rewards except the last one will be zero.
As I mentioned, the purpose of a reward is to give an agent feedback about its success, and it's an important central thing in RL. Basically, the term reinforcement comes from the fact that a reward obtained by an agent should reinforce its behavior in a positive or negative way. Reward is local, meaning, it reflects the success of the agent's recent activity, not all the successes achieved by the agent so far. Of course, getting a large reward for some action doesn't mean that a second later you won't face dramatic consequences from your previous decisions. It's like robbing a bank: it could look like a good idea until you think about the consequences.
What an agent is trying to achieve is the largest accumulated reward over its sequence of actions. To give you a more intuitive understanding of reward, let's list some concrete examples with their rewards:
Financial trading: An amount of profit is a reward for a trader buying and selling stocks.
Chess: Here, reward is obtained at the end of the game, as a win, lose, or draw. Of course, it's up to interpretation. For me, for example, having a draw in a match against a chess master would be a huge reward. In practice, we need to explicitly specify the exact reward value, but it could be a fairly complicated expression. For instance, in case of chess, the reward could be proportional to the opponent's strength.
Dopamine system in a brain: There is a part in the brain (limbic system) that produces dopamine every time it needs to send a positive signal to the rest of the brain. Higher concentrations of dopamine lead to a sense of pleasure, which reinforces activities considered by this system as good. Unfortunately, the limbic system is ancient in terms of things it considers good: food, reproduction, and dominance, but this is a totally different story.
Computer games: They usually give obvious feedback to the player, which is either the number of enemies killed or a score gathered. Note in this example that reward is already accumulated, so the RL reward for arcade games should be the derivative of the score, that is, +1 every time a new enemy is killed and 0 at all other time steps.
Web navigation: There is a set of problems with high practical value, which is to be able to automatically extract information present on the web. Search engines are trying to solve this task in general, but sometimes, to get to the data you're looking for you need to fill some forms or navigate through series of links, or complete captchas, which can be difficult for search engines to do. There is an RL-based approach to those tasks, in which the reward is the information or the outcome you need to get.
Neural network architecture search: RL has been successfully applied to the domain of NN architecture optimization, where the aim is to get the best performance metric on some dataset by tweaking the number of layers or their parameters, adding extra bypass connections, or making other changes to the neural network architecture. The reward in this case is the performance (accuracy or another measure showing how accurate the NN predictions are).
Dog training: If you have ever tried to train a dog, you know that you need to give it something tasty (but too not much) every time it does the thing you've asked. It's also common to punish your pet a bit (negative reward) when it doesn't follow your orders, although recent studies have shown this isn't as effective as positive rewards.
School marks: We all have experience here! School marks are a reward system to give pupils feedback about their studying.
As you can see from the preceding examples, the notion of reward is a very general indication of the agent's performance, and it can be found or artificially injected into lots of practical problems around us.
An agent is somebody or something who/which interacts with the environment by executing certain actions, taking observations, and receiving eventual rewards for this. In most practical RL scenarios, it's our piece of software that is supposed to solve some problem in a more-or-less efficient way. For our initial set of six examples, the agents will be one of these:
Financial trading: A trading system or a trader making decisions about order execution
Chess: A player or a computer program
Dopamine system: The brain itself, according to sensory data, decides if it was a good experience or bad
Computer games: The player who enjoys the game or the computer program (Andrey Karpathy once stated in his tweet, "We were supposed to make AI do all the work and we play games but we do all the work and the AI is playing games!")
Web navigation: The software that tells the browser which links to click on, where to move the mouse, or which text to enter
Dog training: Your beloved pet
The environment is everything outside of an agent. In the most general sense, it's the rest of the universe, but this goes slightly overboard and exceeds the capacity of even tomorrow's computers, so we usually follow the general sense here.
The environment is external to an agent, and its communication with the environment is limited by rewards (obtained from the environment), actions (executed by the agent and given to the environment), and observations (some information besides the rewards that the agent receives from the environment). We discussed rewards already, so let's talk about actions and observations.
Actions are things that an agent can do in the environment. Actions can be moves allowed by the rules of play (if it's some game), or it can be doing homework (in the case of school). They can be simple such as move pawn one space forward, or complicated such as fill the tax form in for tomorrow morning.
In RL, we distinguish between two types of actions: discrete or continuous. Discrete actions form the finite set of mutually exclusive things an agent could do, such as move left or right. Continuous actions have some value attached to the action, such as a car's action steer the wheel having an angle and direction of steering. Different angles could lead to a different scenario a second later, so just saying steer the wheel is definitely not enough.
Observations of the environment is the second information channel for an agent, with the first being a reward. You may be wondering, why do we need a separate data source? The answer is convenience. Observations are pieces of information that the environment provides the agent with, which say what's going on around them. It may be relevant to the upcoming reward (such as seeing a bank notification saying, You have been paid) or not. Observations even can include reward information in some vague or obfuscated form, such as score numbers on a computer game's screen. Score numbers are just pixels, but potentially we can convert them into reward values; it's not a big deal with modern deep learning at hand.
On the other hand, reward shouldn't be seen as a secondary or unimportant thing: the reward is the main force that drives the agent's learning process. If the reward is made wrong, noisy, or just slightly off-course of the primary objective, then there is a chance that training will go in a wrong way.
It's also important to distinguish between an environment's state and observations. The state of an environment potentially includes every atom in the universe, which makes it impossible to measure everything about the environment. Even if we limit the environment's state to be small enough, most of the time it's either still not possible to get full information or our measurements will contain noise. This is completely fine though, and RL was created to support such cases natively. Once again, let's support our intuition with our set of examples to capture the difference:
Financial trading: Here the environment is the whole financial market and everything that influences it. This is a huge list of things such as the latest news, economic and political conditions, weather, food supplies, and Twitter trends. Even your decision to stay home today can potentially indirectly influence the world financial system. However, our observations are limited to stock prices, news, and so on. We don't have access to most of the environment's state, which makes trading such a nontrivial thing.
Chess: The environment here is your board plus your opponent, which includes their chess skills, mood, brain state, chosen tactics, and so on. Observation is what you see (your current chess position), but, I guess, at some levels of play mastery, the knowledge of psychology and ability to read an opponent's mood could increase your chances.
Dopamine system: The environment here is your brain PLUS nervous system and organ's states PLUS the whole world you can perceive. Observations are the inner brain state and signals coming from your senses.
Computer game: Here, the environment is your computer's state, including all memory and disk data. For networked games, you need to include other computers PLUS all internet infrastructure between them and your machine. Observations are a screen's pixels and sound, that's it. A screen's pixels is not a tiny amount of information (somebody calculated that the total number of possible moderate-size images 1024 × 768 is significantly larger than the number of atoms in our galaxy), but the whole environment state is definitely larger.
Web navigation: The environment here is the internet, including all the network infrastructure between the computer our agent works and the web server, which is a really huge system that includes millions and millions of different components. Observation is normally the web page that is loaded at the current navigation step.
Neural network architecture search: In this example, the environment is fairly simple and includes the NN toolkit that performs the particular neural network evaluation and the dataset that is used to obtain the performance metric. In comparison to the internet, this looks like a tiny toy environment. Observations might be different and include some information about the testing, such as loss convergence dynamics or other metrics obtained from the evaluation step.
Dog training: Here the environment is your dog (including its hardly observable inner reactions, mood, and life experiences) and everything around it, including other dogs and a cat hiding in a bush. Observations are signals from your senses and memory.
School: The environment here is the school itself, the education system of the country, society, and the cultural legacy. Observations are the same as for the dog training: the student's senses and memory.
This is our mise en scène and we'll play around with it in the rest of the book. I think you've already noticed that the RL model is extremely flexible, general, and could be applied to a variety of scenarios. Let's look at how RL is related to other disciplines, before diving into the details of RL's model.
There are many other areas that contribute or relate to RL. The most significant are shown in the following diagram (taken from David Silver's RL course http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html), which includes six large domains heavily overlapping each other on the methods and specific topics related to decision making (shown inside the inner gray circle). In the intersection of all those related, but still different scientific areas, sits RL, which is so general and flexible that it can take the best from these varying domains:
Machine learning (ML): RL, being a subfield of ML, borrows lots of its machinery, tricks, and techniques from ML. Basically, the goal of RL is to learn how an agent should behave when it is given imperfect observational data.
Engineering (especially optimal control): This helps in taking a sequence of optimal actions to get the best result.
Neuroscience: We saw the dopamine system as our example, and it has been shown that the human brain acts closely to the RL model.
Psychology: This studies behavior in various conditions, such as how people react and adapt, which is close to the RL topic.
Economics: One of the important topics is how to maximize reward in terms of imperfect knowledge and the changing conditions of the real world.
Mathematics: This works with idealized systems, and also devotes significant attention to finding and reaching the optimal conditions in the field of operations research.