CS6700: Reinforcement learning

Course information

  • When: Jul-Nov 2018

  • Lectures: Slot J

  • Where: CS26

  • Teaching Assistants: Ajay Kumar Pandey, Nirav Bhavsar, Nithia V

  • Office Hours: Wednesday, 3:30pm to 4:30pm

Course Content

Markov Decision Processes

  • Finite horizon model

    • General theory

    • DP algorithm

  • Infinite horizon model: (1) Stochastic shortest path

    • General theory: Contraction mapping, Bellman equation

    • Computational solution schemes: Value and policy iteration, convergence analysis

  • Infinite horizon model: (2) Discounted cost MDPs

    • General theory: Contraction mapping, Bellman equation

    • Classical solution techniques: value and policy iteration

Reinforcement learning

  • Stochastic iterative algorithm

    • Convergence result for contraction mappings

  • Tabular methods

    • Monte Carlo policy evaluation

    • Temporal difference learning

      • TD(0), TD(lambda)

      • Convergence analysis

    • Q-learning and its convergence analysis

  • Function approximation

    • Approximate policy iteration and error bounds

    • Approximate policy evaluation using TD(lambda)

      • The case of linear function approximation

      • Convergence analysis

      • Least-squares methods: LSTD and LSPI

    • Q-learning

  • Tentative list of other topics:

    • Policy-gradient and actor-critic algorithms

      • Policy gradient theorem

      • Gradient estimation using likelihood ratios

      • Actor-critic methods: Compatible features, essential features, advantage updating

    • Regret minimization in MDPs

      • Introduction to bandits and UCB algorithm

      • UCRL and PSRL algorithms

The portion on MDPs roughly coincides with Chapters 1 of Vol. I of Dynamic programming and optimal control book of Bertsekas and Chapter 2, 4, 5 and 6 of Neuro dynamic programming book of Bertsekas and Tsitsiklis. For several topics, the book by Sutton and Barto is an useful reference, in particular, to obtain an intuitive understanding. Also, Chapter 6 and 7 of DP/OC Vol II is a useful reference of the advanced topics under RL with function approximation.

Honourable omissions: Neural networks, Average cost models.


  • Project: 35%

  • Homeworks: 30% (Best 3 out of 4)

  • Quizzes: 30% (15% for each quiz)

  • Paper summary: 5%

Important Dates

Quiz 1 Sep 8
Quiz 2 Oct 7
Available on Submission by
Homework 1 Aug 20 Sep 2
Homework 2 Sep 10 Sep 20
Homework 3 Sep 22 Oct 2
Homework 4 Oct 7 Oct 17
Paper summary Oct 30
Project proposal Oct 30
Project report Nov 29

Paper review

Each individual student is expected to read a paper from a list to be provided soon and submit a 1-page critical review that identifies the strengths and areas of improvement of the paper studied. A paraphrase of the paper content is strongly discouraged. A list of interesting (albeit non-exhaustive) RL papers will be made available soon.

Course project

  • The project could be theoretical and/or practical.

  • Theoretical projects could involve a novel problem formulation or algorithmic adaptation or extension of existing theory of RL algorithms.

  • Practical projects could involve implementation of existing or new RL algorithms on a new benchmark.

  • Given the limited timeframe of the course, working out a new result in theory may not be completely feasible and in that case, the project could be converted into an empirical investigation of the algorithms that the students familiarized with, while attempting to formulate/solve a RL problem.

  • The report is expected to provide a detailed description of the problem, accompanying algorithms and novel theoretical results or findings from an empirical investigation.

The evaluation would be as follows: 5% for the project proposal, 25% for the final report and 5% for a 15-minute presentation of the work done.


  • D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. I, Athena Scientific, 2017

  • D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. II, Athena Scientific, 2012

  • D.P.Bertsekas and J.N.Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996

  • R.S.Sutton and A.G.Barto, Reinforcement Learning: An Introduction, MIT Press, 1998. For an updated draft, Click here

Quizzes and homeworks


Theory of Markov decision processes
Lecture 1 Finite horizon MDPs: a few examples 1.1
Lecture 2* Finite horizon MDP framework, Open vs. closed loop policies 1.2,1.3
Lecture 3* Principle of optimality and the DP algorithm 1.4
Lecture 4* Proof of optimality of the DP algorithm 1.4
Lecture 5 Two examples: linear system with quadratic cost and asset selling 1.4
Lecture 6* Infinite horizon MDPs: an introduction and preview of the main results 2.1
Lecture 7* Stochastic shortest path (SSP) formulation 2.2
Lecture 8 Proper policies, Bellman optimality operator, Fixed point equation for policy evaluation 2.2
Lecture 9* Bellman's equation and optimality conditions 2.2
Lecture 10 Bellman's equation and optimality conditions (contd) 2.2
Lecture 11* SSP examples
Lecture 12 HW1 discussion
Lecture 13 Value iteration
Lecture 14* Policy iteration, introduction to contraction mappings
Lecture 15* Contraction mappings in an SSP
Lecture 16* Discounted cost MDP formulation
Lecture 17 Bellman equation, optimality conditions
Lecture 18* Underlying contraction properties, value iteration, policy iteration
Lecture 19* Linear programming approach, Q-learning
Reinforcement learning
Lecture 1 Introduction to reinforcement learning problem, connection to stochastic approximation
Lecture 2* First and second-order optimality conditions, Gradient descent algorithms
Lecture 3* Probability recap: introduction to sigma fields
Lecture 4* Probability recap: conditional distributions and expectation, martingale differences
Lecture 5 Stochastic iterative algorithms: Convergence in the Lyapunov case
Lecture 6* Stochastic iterative algorithms: Convergence in the sup-norm pseudo contraction mapping case
Lecture 7* Monte Carlo policy evaluation
Lecture 8 Temporal difference learning
Lecture 9* TD(0)
Lecture 10* TD(lambda)
Lecture 11 Offline vs online TD
Lecture 12* Convergence analysis of TD(0)
Lecture 13* TD(lambda): Underlying contraction property, Introduction to Q-learning
Lecture 14 Q-learning: convergence analysis, need for exploration
Lecture 15* Approximate policy evaluation using projected equations
Lecture 16* Brief introduction to Markov chains, Contraction properties for projected equation methods
Lecture 17* Projected equation methods: LSTD and TD(0) variants
Lecture 18 Q-learning with function approximation
Lecture 19* Introduction to policy gradient methods
Lecture 20* REINFORCE algorithm, Overview of actor-critic algorithms

{}{raw} <p style=“font-size:0.6em;color:grey”> <i> The lectures marked * are 75 minutes long, while the rest are of 50 minutes duration. <i> <p>