CS6700: Reinforcement learning
Course information
When: JulNov 2021
Lectures: Slot J
Where: Online
Teaching Assistants: Nithia Vijayan, Mizhaan Maniyar, Akash Reddy, Rebin Silva, Sandip Saha, and Nency Bansal
Course Content
Markov Decision Processes (MDPs)
Finite horizon MDPs: General theory, DP algorithm
Infinite horizon model: (1) Stochastic shortest path
General theory: Contraction mapping, Bellman equation
Computational solution schemes: Value and policy iteration, convergence analysis
Infinite horizon model: (2) Discounted cost MDPs
General theory: Contraction mapping, Bellman equation
Classical solution techniques: value and policy iteration
Reinforcement learning
Stochastic iterative algorithm
Convergence result for contraction mappings
Tabular methods
Monte Carlo policy evaluation
Temporal difference learning
TD(0), TD(lambda)
Convergence analysis
Qlearning and its convergence analysis
Function approximation
Approximate policy evaluation using TD(lambda)
Leastsquares methods: LSTD and LSPI
Introduction to policygradient algorithms

Policy gradient theorem
Gradient estimation using likelihood ratios

The portion on MDPs roughly coincides with Chapters 1 of Vol. I of Dynamic programming and optimal control book of Bertsekas and Chapter 2, 4, 5 and 6 of Neuro dynamic programming book of Bertsekas and Tsitsiklis. For several topics, the book by Sutton and Barto is an useful reference, in particular, to obtain an intuitive understanding. Also, Chapter 6 and 7 of DP/OC Vol II is a useful reference of the advanced topics under RL with function approximation.
Honourable omissions: Neural networks, Average cost models.
The schedule of lectures from the 2019 run of this course is available here
Grading
Midterm exam: 25%
Final exam: 25%
Assignment 1: 10%
Assignment 2: 15%
Project: 20%
Paper review: 5%
Important Dates
Assignment 1: Sep 17, Assignment 2: Oct 25
Midterm: Sep 27
Final: Nov 18
Project: Nov 5
Paper review: Oct 28
Paper review
Students form a team of size two, and read a paper on RL. Each team submits a 1page critical review that identifies the strengths and areas of improvement of the paper studied. A paraphrase of the paper content is strongly discouraged.
Course project
The students work in teams of size two.
Each team is required to implement an RL algorithm of their choice.
A few MDP environments would be provided for tuning the RL algorithm.
The submitted RL algorithm would be evaluated across different MDP environments, and assigned a score based on different performance indicators.
Each team is required to submit a short report providing a description of the chosen algorithm, and its associated parameters.
Textbooks/References
D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. I, Athena Scientific, 2017
D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. II, Athena Scientific, 2012
D.P.Bertsekas and J.N.Tsitsiklis, NeuroDynamic Programming, Athena Scientific, 1996
R.S.Sutton and A.G.Barto, Reinforcement Learning: An Introduction, MIT Press, 2020.