CS6700: Reinforcement learning
Course information
- When: Feb-May 2021 
- Lectures: Slot J 
- Where: Online 
- Teaching Assistants: Nithia Vijayan, Dipayan Sen 
Course Content
Markov Decision Processes
- Finite horizon model - General theory 
- DP algorithm 
 
- Infinite horizon model: (1) Stochastic shortest path - General theory: Contraction mapping, Bellman equation 
- Computational solution schemes: Value and policy iteration, convergence analysis 
 
- Infinite horizon model: (2) Discounted cost MDPs - General theory: Contraction mapping, Bellman equation 
- Classical solution techniques: value and policy iteration 
 
Reinforcement learning
- Stochastic iterative algorithm - Convergence result for contraction mappings 
 
- Tabular methods - Monte Carlo policy evaluation 
- Temporal difference learning - TD(0), TD(lambda) 
- Convergence analysis 
 
- Q-learning and its convergence analysis 
 
- Function approximation - Approximate policy iteration and error bounds 
- Approximate policy evaluation using TD(lambda) - The case of linear function approximation 
- Convergence analysis 
- Least-squares methods: LSTD and LSPI 
 
- Q-learning 
 
- Tentative list of other topics: - Policy-gradient and actor-critic algorithms - Policy gradient theorem 
- Gradient estimation using likelihood ratios 
- Actor-critic methods: Compatible features, essential features, advantage updating 
 
- Regret minimization in MDPs - Introduction to bandits and UCB algorithm 
- UCRL and PSRL algorithms 
 
 
The portion on MDPs roughly coincides with Chapters 1 of Vol. I of Dynamic programming and optimal control book of Bertsekas and Chapter 2, 4, 5 and 6 of Neuro dynamic programming book of Bertsekas and Tsitsiklis. For several topics, the book by Sutton and Barto is an useful reference, in particular, to obtain an intuitive understanding. Also, Chapter 6 and 7 of DP/OC Vol II is a useful reference of the advanced topics under RL with function approximation.
Honourable omissions: Neural networks, Average cost models.
The schedule of lectures from the 2019 run of this course is available here
Grading
- Mid-term exam: 25% 
- Final exam: 25% 
- Assignments: 10% each 
- Project: 20% 
- Paper review: 5% 
- Scribe: 5% 
Important Dates
- Assignment 1: Mar 10, Assignment 2: Apr 20 
- Mid-term: Mar 24 
- Final: May 13 
- Project: May 5 (both report and presentation are due on this date) 
- Paper review: Apr 26 
- Scribing: May 10 
Paper review
Each individual student is expected to read a paper from a list to be provided soon and submit a 1-page critical review that identifies the strengths and areas of improvement of the paper studied. A paraphrase of the paper content is strongly discouraged.
Course project
- The students work in teams of size two. 
- Each team is required to implement an RL algorithm of their choice. 
- A few MDP environments would be provided for tuning the RL algorithm. 
- The submitted RL algorithm would be evaluated across different MDP environments, and assigned a score based on different performance indicators. 
- Each team is required to submit a short report providing a description of the chosen algorithm, and its associated parameters. 
Assignments facilitated by Aicrowd
- Assignment 1: Click here 
- Assignment 2: Click here 
- Project: Click here 
Textbooks/References
- D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. I, Athena Scientific, 2017 
- D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. II, Athena Scientific, 2012 
- D.P.Bertsekas and J.N.Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996 
- R.S.Sutton and A.G.Barto, Reinforcement Learning: An Introduction, MIT Press, 2020. 
