The book I spent my Christmas holidays with was Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto. The authors are considered the founding fathers of the field. And the book is an often-referred textbook and part of the basic reading list for AI researchers. Given my own interest and fledgling attempts in the area (I trained my first models in 2017), I thought worthwhile to spend some time learning some basics.
Reinforcement learning is one of the hottest fields in programming. But what does it mean specifically? Basically it is learning what to do - how to map situations to actions - so as to maximize a numerical reward signal. The computer is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward, but also the next situation and, through that, all subsequent rewards.
The truth is you don’t have to read books like this to do some very basic AI work in 2018. If you have coding experience and some grasp of statistics and logics, you can skip to Youtube videos and openly available free courses with open-sourced code examples. You can use Tensorflow on your home computer or cloud deals by Google, Amazon or Microsoft to do the training, etc.
Another important thing to understand is you won’t learn programming or machine learning just by reading books. You’ve got to get your hands dirty. You have to do actual coding. That’s how human learning works ;).
But reading this book will certainly help. The book does require some grasp of math, logics, statistics, set theory and probability. But you can learn along the way.
To approach reinforcement learning, the best way is to first understand the problem it tries to resolve and only then study the algorithms which attempt that in one way or another. The authors explain that the reinforcement learning agent and its environment interact over a sequence of discrete time steps. The specification of their interface defines a particular task: the actions are the choices made by the agent; the states are basis for making the choices; and the rewards are the basis for evaluating the choices. Everything inside the agent is completely known and controllable by the agent; everything outside is incompletely controllable but may or may not be completely known. A policy is a stochastic rule by which the agent selects actions as a function of states. The agent's objective is to maximize the amount of reward it receives over time.
The return is the function of future rewards that the agent seeks to maximize. It has several different definitions depending upon whether one is interested in total reward or discounted reward. The first is appropriate for episodic tasks, in which the agent environment interaction breaks naturally into episodes; the second is appropriate for continual tasks, in which the interaction does not naturally break into episodes but continues without limit.
An environment satisfies the Markov property if its state compactly summarizes the past without degrading the ability to predict the future. This is rarely exactly true, but often nearly so; the state signal should be chosen or constructed so that the Markov property approximately holds. If the Markov property does hold, then the environment is called a Markov decision process (MDP). A finite MDP is an MDP with finite state and action sets. Most of the current theory of reinforcement learning is restricted to finite MDPs, but the methods and ideas apply more generally. A policy's value function assigns to each state the expected return from that state given that the agent uses the policy. The optimal value function assigns to each state the largest expected return achievable by any policy, write the authors.
After dealing with the reinforcement learning problem and some history of the field in Part I, Sutton and Barto analyze a variety of methods to deal with a variety of tasks for machine learning. You will read about dynamic programming, Monte Carlo methods, temporal difference learning (Sutton himself has contributed a lot to this approach).
All of the reinforcement learning methods the authors explore in this book have three key ideas in common. First, the objective of all of them is the estimation of value functions. Second, all operate by backing up values along actual or possible state trajectories. Third, all follow the general strategy of generalized policy iteration (GPI), meaning that they maintain an approximate value function and an approximate policy, and they continually try to improve each on the basis of the other. Interesting insight is that these approaches can be combined quite efficiently.
Part III offers a number of case studies where reinforcement learning was applied.
Although the book would have benefited greatly if it included the analysis of deep reinforcement learning techniques yielding fantastic results over the past few years, the book is a great source to learn from.