Reinforcement learning

Prashanth L.A.

Reinforcement learning

Course Notes for CS6700 (RL)

Author

Affiliation

Prashanth L.A.

Indian Institute of Technology Madras

Updated

October 23, 2025

Preface

This is a hastily written version of the lecture notes used in the “CS6700: Reinforcement learning” course. The portion on the theory of MDPs roughly coincides with Chapter 1 of (Bertsekas 2017), and Chapters 2, 4, 5 and 6 of (Bertsekas and Tsitsiklis 1996). For several topics, (Sutton and Barto 1998) is an useful reference, in particular, to obtain an intuitive understanding. Also, Chapters 6 and 7 of (Bertsekas 2012) are useful reference material for the advanced topics, such as RL with function approximation.

I would like to thank the students of Jan-May’2021 batch of CS6700 for help in typesetting a portion of these notes. Do note that these notes require a major editorial revision, as well as a round of proofreading, and the reader is to be wary of the errors. As an alternative, the textbooks cited above are excellent source material for learning the foundations of RL.

A special thanks to Prof. Aditya Mahajan for providing the Quarto template.

About the course

Course Content

Markov Decision Processes (MDPs)
- Finite horizon MDPs
  - General theory
  - DP algorithm
- Infinite horizon model (1): Stochastic shortest path
  - General theory: Contraction mapping, Bellman equation
  - Computational solution schemes: Value and policy iteration, convergence analysis
- Infinite horizon model (2): Discounted cost MDPs
  - General theory: Contraction mapping, Bellman equation
  - Classical solution techniques: Value and policy iteration
Reinforcement Learning
- Stochastic approximation
  - Introduction and connection to RL
  - Convergence result for contraction mappings
- Tabular methods
  - Monte Carlo policy evaluation
  - Temporal difference learning
    - TD(0), TD(λ)
    - Convergence analysis
  - Q-learning and its convergence analysis
- Function approximation
  - Approximate policy evaluation using TD(λ)
  - Least-squares methods: LSTD and LSPI
- Policy-gradient algorithms
  - Policy gradient theorem
  - Gradient estimation using likelihood ratios
  - Variants (REINFORCE, PPO, etc.)

Reference books

D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. I, Athena Scientific, 2017
D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. II, Athena Scientific, 2012
D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, 2020

Bertsekas, D.P. 2012. Dynamic programming and optimal control, vol. II, 4th edition. Athena Scientific.

Bertsekas, D.P. 2017. Dynamic programming and optimal control, vol. i. Athena Scientific.

Bertsekas, D. and Tsitsiklis, J. 1996. Neuro-dynamic programming. Athena Scientific.

Sutton, R. and Barto, A. 1998. Reinforcement learning: An introduction. MIT Press.