You can refer to the following articles:
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning, Ronald J. Williams, 1992
Policy Gradient Methods for Reinforcement Learning with Function Approximation, Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour, 1999
Playing Atari with Deep Reinforcement Learning, Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller, 2013
Mastering the Game of Go with Deep Neural Networks and Tree Search, David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel & Demis Hassabis, 2016
Asynchronous Methods for Deep Reinforcement Learning, Volodymyr Mnih, Adrià Puigdomènech...