CS6700: Reinforcement learning

Course information

Course Content

Markov Decision Processes

  • Finite horizon model

    • General theory

    • DP algorithm

  • Infinite horizon model: (1) Stochastic shortest path

    • General theory: Contraction mapping, Bellman equation

    • Computational solution schemes: Value and policy iteration, convergence analysis

  • Infinite horizon model: (2) Discounted cost MDPs

    • General theory: Contraction mapping, Bellman equation

    • Classical solution techniques: value and policy iteration

Reinforcement learning

  • Stochastic iterative algorithm

    • Convergence result for contraction mappings

  • Tabular methods

    • Monte Carlo policy evaluation

    • Temporal difference learning

      • TD(0), TD(lambda)

      • Convergence analysis

    • Q-learning and its convergence analysis

  • Function approximation

    • Approximate policy iteration and error bounds

    • Approximate policy evaluation using TD(lambda)

      • The case of linear function approximation

      • Convergence analysis

      • Least-squares methods: LSTD and LSPI

    • Q-learning

  • Tentative list of other topics:

    • Policy-gradient and actor-critic algorithms

      • Policy gradient theorem

      • Gradient estimation using likelihood ratios

      • Actor-critic methods: Compatible features, essential features, advantage updating

    • Regret minimization in MDPs

      • Introduction to bandits and UCB algorithm

      • UCRL and PSRL algorithms

The portion on MDPs roughly coincides with Chapters 1 of Vol. I of Dynamic programming and optimal control book of Bertsekas and Chapter 2, 4, 5 and 6 of Neuro dynamic programming book of Bertsekas and Tsitsiklis. For several topics, the book by Sutton and Barto is an useful reference, in particular, to obtain an intuitive understanding. Also, Chapter 6 and 7 of DP/OC Vol II is a useful reference of the advanced topics under RL with function approximation.

Honourable omissions: Neural networks, Average cost models.

The schedule of lectures from the 2019 run of this course is available here


  • Mid-term exam: 25%

  • Final exam: 25%

  • Assignments: 10% each

  • Project: 20%

  • Paper review: 5%

  • Scribe: 5%

Important Dates

  • Assignment 1: Mar 10, Assignment 2: Apr 20

  • Mid-term: Mar 24

  • Final: May 13

  • Project: May 5 (both report and presentation are due on this date)

  • Paper review: Apr 26

  • Scribing: May 10

Paper review

Each individual student is expected to read a paper from a list to be provided soon and submit a 1-page critical review that identifies the strengths and areas of improvement of the paper studied. A paraphrase of the paper content is strongly discouraged.

Course project

  • The students work in teams of size two.

  • Each team is required to implement an RL algorithm of their choice.

  • A few MDP environments would be provided for tuning the RL algorithm.

  • The submitted RL algorithm would be evaluated across different MDP environments, and assigned a score based on different performance indicators.

  • Each team is required to submit a short report providing a description of the chosen algorithm, and its associated parameters.

Assignments facilitated by Aicrowd


  • D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. I, Athena Scientific, 2017

  • D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. II, Athena Scientific, 2012

  • D.P.Bertsekas and J.N.Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996

  • R.S.Sutton and A.G.Barto, Reinforcement Learning: An Introduction, MIT Press, 2020.