We knew that reinforcement learning optimizes the reward for an agent in the environment, and the Markov decision process (MDP) is a type of environment representation and mathematical framework for modeling the decisions using states, actions, and rewards. In this chapter, we understood that Q-learning is an approach that finds the optimal action selection policy for any MDP without any transition models. On the other hand, value iteration finds the optimal action selection policy for any MDP if a transition model is given.
We also learned another important topic called the deep-Q network, which is a modified Q-learning approach that takes a deep neural network as a function approximator to generalize across different environments, unlike a Q-table, which is environment specific. Moreover, we also learnt to implement Q-learning, deep Q-networks, and SARSA algorithms in OpenAI gym environments. Most of the implementation shown previously might work better with better hyperparameter...