CS6700: Reinforcement learning
Course information
When: Jul-Nov 2018
Lectures: Slot J
Where: CS26
Teaching Assistants: Ajay Kumar Pandey, Nirav Bhavsar, Nithia V
Office Hours: Wednesday, 3:30pm to 4:30pm
Course Content
Markov Decision Processes
Finite horizon model
General theory
DP algorithm
Infinite horizon model: (1) Stochastic shortest path
General theory: Contraction mapping, Bellman equation
Computational solution schemes: Value and policy iteration, convergence analysis
Infinite horizon model: (2) Discounted cost MDPs
General theory: Contraction mapping, Bellman equation
Classical solution techniques: value and policy iteration
Reinforcement learning
Stochastic iterative algorithm
Convergence result for contraction mappings
Tabular methods
Monte Carlo policy evaluation
Temporal difference learning
TD(0), TD(lambda)
Convergence analysis
Q-learning and its convergence analysis
Function approximation
Approximate policy iteration and error bounds
Approximate policy evaluation using TD(lambda)
The case of linear function approximation
Convergence analysis
Least-squares methods: LSTD and LSPI
Q-learning
Tentative list of other topics:
Policy-gradient and actor-critic algorithms
Policy gradient theorem
Gradient estimation using likelihood ratios
Actor-critic methods: Compatible features, essential features, advantage updating
Regret minimization in MDPs
Introduction to bandits and UCB algorithm
UCRL and PSRL algorithms
The portion on MDPs roughly coincides with Chapters 1 of Vol. I of Dynamic programming and optimal control book of Bertsekas and Chapter 2, 4, 5 and 6 of Neuro dynamic programming book of Bertsekas and Tsitsiklis. For several topics, the book by Sutton and Barto is an useful reference, in particular, to obtain an intuitive understanding. Also, Chapter 6 and 7 of DP/OC Vol II is a useful reference of the advanced topics under RL with function approximation.
Honourable omissions: Neural networks, Average cost models.
Grading
Project: 35%
Homeworks: 30% (Best 3 out of 4)
Quizzes: 30% (15% for each quiz)
Paper summary: 5%
Important Dates
On | |
---|---|
Quiz 1 | Sep 8 |
Quiz 2 | Oct 7 |
Available on | Submission by | |
---|---|---|
Homework 1 | Aug 20 | Sep 2 |
Homework 2 | Sep 10 | Sep 20 |
Homework 3 | Sep 22 | Oct 2 |
Homework 4 | Oct 7 | Oct 17 |
Paper summary | Oct 30 | |
Project proposal | Oct 30 | |
Project report | Nov 29 |
Paper review
Each individual student is expected to read a paper from a list to be provided soon and submit a 1-page critical review that identifies the strengths and areas of improvement of the paper studied. A paraphrase of the paper content is strongly discouraged. A list of interesting (albeit non-exhaustive) RL papers will be made available soon.
Course project
The project could be theoretical and/or practical.
Theoretical projects could involve a novel problem formulation or algorithmic adaptation or extension of existing theory of RL algorithms.
Practical projects could involve implementation of existing or new RL algorithms on a new benchmark.
Given the limited timeframe of the course, working out a new result in theory may not be completely feasible and in that case, the project could be converted into an empirical investigation of the algorithms that the students familiarized with, while attempting to formulate/solve a RL problem.
The report is expected to provide a detailed description of the problem, accompanying algorithms and novel theoretical results or findings from an empirical investigation.
The evaluation would be as follows: 5% for the project proposal, 25% for the final report and 5% for a 15-minute presentation of the work done.
Textbooks/References
D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. I, Athena Scientific, 2017
D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. II, Athena Scientific, 2012
D.P.Bertsekas and J.N.Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996
R.S.Sutton and A.G.Barto, Reinforcement Learning: An Introduction, MIT Press, 1998. For an updated draft, Click here
Quizzes and homeworks
Quiz 2 (Note: Q3 requires f to be monotone)
Schedule
Theory of Markov decision processes | ||
---|---|---|
Lecture 1 | Finite horizon MDPs: a few examples | 1.1 |
Lecture 2* | Finite horizon MDP framework, Open vs. closed loop policies | 1.2,1.3 |
Lecture 3* | Principle of optimality and the DP algorithm | 1.4 |
Lecture 4* | Proof of optimality of the DP algorithm | 1.4 |
Lecture 5 | Two examples: linear system with quadratic cost and asset selling | 1.4 |
Lecture 6* | Infinite horizon MDPs: an introduction and preview of the main results | 2.1 |
Lecture 7* | Stochastic shortest path (SSP) formulation | 2.2 |
Lecture 8 | Proper policies, Bellman optimality operator, Fixed point equation for policy evaluation | 2.2 |
Lecture 9* | Bellman's equation and optimality conditions | 2.2 |
Lecture 10 | Bellman's equation and optimality conditions (contd) | 2.2 |
Lecture 11* | SSP examples | |
Lecture 12 | HW1 discussion | |
Lecture 13 | Value iteration | |
Lecture 14* | Policy iteration, introduction to contraction mappings | |
Lecture 15* | Contraction mappings in an SSP | |
Lecture 16* | Discounted cost MDP formulation | |
Lecture 17 | Bellman equation, optimality conditions | |
Lecture 18* | Underlying contraction properties, value iteration, policy iteration | |
Lecture 19* | Linear programming approach, Q-learning | |
Reinforcement learning | ||
Lecture 1 | Introduction to reinforcement learning problem, connection to stochastic approximation | |
Lecture 2* | First and second-order optimality conditions, Gradient descent algorithms | |
Lecture 3* | Probability recap: introduction to sigma fields | |
Lecture 4* | Probability recap: conditional distributions and expectation, martingale differences | |
Lecture 5 | Stochastic iterative algorithms: Convergence in the Lyapunov case | |
Lecture 6* | Stochastic iterative algorithms: Convergence in the sup-norm pseudo contraction mapping case | |
Lecture 7* | Monte Carlo policy evaluation | |
Lecture 8 | Temporal difference learning | |
Lecture 9* | TD(0) | |
Lecture 10* | TD(lambda) | |
Lecture 11 | Offline vs online TD | |
Lecture 12* | Convergence analysis of TD(0) | |
Lecture 13* | TD(lambda): Underlying contraction property, Introduction to Q-learning | |
Lecture 14 | Q-learning: convergence analysis, need for exploration | |
Lecture 15* | Approximate policy evaluation using projected equations | |
Lecture 16* | Brief introduction to Markov chains, Contraction properties for projected equation methods | |
Lecture 17* | Projected equation methods: LSTD and TD(0) variants | |
Lecture 18 | Q-learning with function approximation | |
Lecture 19* | Introduction to policy gradient methods | |
Lecture 20* | REINFORCE algorithm, Overview of actor-critic algorithms |
{}{raw} <p style=“font-size:0.6em;color:grey”> <i> The lectures marked * are 75 minutes long, while the rest are of 50 minutes duration. <i> <p>