CS6700: Reinforcement learning
Course information
When: JulNov 2018
Lectures: Slot J
Where: CS26
Teaching Assistants: Ajay Kumar Pandey, Nirav Bhavsar, Nithia V
Office Hours: Wednesday, 3:30pm to 4:30pm
Course Content
Markov Decision Processes
Finite horizon model
General theory
DP algorithm
Infinite horizon model: (1) Stochastic shortest path
General theory: Contraction mapping, Bellman equation
Computational solution schemes: Value and policy iteration, convergence analysis
Infinite horizon model: (2) Discounted cost MDPs
General theory: Contraction mapping, Bellman equation
Classical solution techniques: value and policy iteration
Reinforcement learning
Stochastic iterative algorithm
Convergence result for contraction mappings
Tabular methods
Monte Carlo policy evaluation
Temporal difference learning
TD(0), TD(lambda)
Convergence analysis
Qlearning and its convergence analysis
Function approximation
Approximate policy iteration and error bounds
Approximate policy evaluation using TD(lambda)
The case of linear function approximation
Convergence analysis
Leastsquares methods: LSTD and LSPI
Qlearning
Tentative list of other topics:
Policygradient and actorcritic algorithms
Policy gradient theorem
Gradient estimation using likelihood ratios
Actorcritic methods: Compatible features, essential features, advantage updating
Regret minimization in MDPs
Introduction to bandits and UCB algorithm
UCRL and PSRL algorithms
The portion on MDPs roughly coincides with Chapters 1 of Vol. I of Dynamic programming and optimal control book of Bertsekas and Chapter 2, 4, 5 and 6 of Neuro dynamic programming book of Bertsekas and Tsitsiklis. For several topics, the book by Sutton and Barto is an useful reference, in particular, to obtain an intuitive understanding. Also, Chapter 6 and 7 of DP/OC Vol II is a useful reference of the advanced topics under RL with function approximation.
Honourable omissions: Neural networks, Average cost models.
Grading
Project: 35%
Homeworks: 30% (Best 3 out of 4)
Quizzes: 30% (15% for each quiz)
Paper summary: 5%
Important Dates
On  

Quiz 1  Sep 8 
Quiz 2  Oct 7 
Available on  Submission by 


Homework 1  Aug 20  Sep 2 
Homework 2  Sep 10  Sep 20 
Homework 3  Sep 22  Oct 2 
Homework 4  Oct 7  Oct 17 
Paper summary  Oct 30  
Project proposal  Oct 30  
Project report  Nov 29 
Paper review
Each individual student is expected to read a paper from a list to be provided soon and submit a 1page critical review that identifies the strengths and areas of improvement of the paper studied. A paraphrase of the paper content is strongly discouraged. A list of interesting (albeit nonexhaustive) RL papers will be made available soon.
Course project
The project could be theoretical and/or practical.
Theoretical projects could involve a novel problem formulation or algorithmic adaptation or extension of existing theory of RL algorithms.
Practical projects could involve implementation of existing or new RL algorithms on a new benchmark.
Given the limited timeframe of the course, working out a new result in theory may not be completely feasible and in that case, the project could be converted into an empirical investigation of the algorithms that the students familiarized with, while attempting to formulate/solve a RL problem.
The report is expected to provide a detailed description of the problem, accompanying algorithms and novel theoretical results or findings from an empirical investigation.
The evaluation would be as follows: 5% for the project proposal, 25% for the final report and 5% for a 15minute presentation of the work done.
Textbooks/References
D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. I, Athena Scientific, 2017
D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. II, Athena Scientific, 2012
D.P.Bertsekas and J.N.Tsitsiklis, NeuroDynamic Programming, Athena Scientific, 1996
R.S.Sutton and A.G.Barto, Reinforcement Learning: An Introduction, MIT Press, 1998. For an updated draft, Click here
Quizzes and homeworks
Quiz 2 (Note: Q3 requires f to be monotone)
Schedule
Theory of Markov decision processes  

Lecture 1  Finite horizon MDPs: a few examples  1.1 
Lecture 2*  Finite horizon MDP framework, Open vs. closed loop policies  1.2,1.3 
Lecture 3*  Principle of optimality and the DP algorithm  1.4 
Lecture 4*  Proof of optimality of the DP algorithm  1.4 
Lecture 5  Two examples: linear system with quadratic cost and asset selling  1.4 
Lecture 6*  Infinite horizon MDPs: an introduction and preview of the main results  2.1 
Lecture 7*  Stochastic shortest path (SSP) formulation  2.2 
Lecture 8  Proper policies, Bellman optimality operator, Fixed point equation for policy evaluation  2.2 
Lecture 9*  Bellman's equation and optimality conditions  2.2 
Lecture 10  Bellman's equation and optimality conditions (contd)  2.2 
Lecture 11*  SSP examples  
Lecture 12  HW1 discussion  
Lecture 13  Value iteration  
Lecture 14*  Policy iteration, introduction to contraction mappings  
Lecture 15*  Contraction mappings in an SSP  
Lecture 16*  Discounted cost MDP formulation  
Lecture 17  Bellman equation, optimality conditions  
Lecture 18*  Underlying contraction properties, value iteration, policy iteration  
Lecture 19*  Linear programming approach, Qlearning  
Reinforcement learning  
Lecture 1  Introduction to reinforcement learning problem, connection to stochastic approximation  
Lecture 2*  First and secondorder optimality conditions, Gradient descent algorithms  
Lecture 3*  Probability recap: introduction to sigma fields  
Lecture 4*  Probability recap: conditional distributions and expectation, martingale differences  
Lecture 5  Stochastic iterative algorithms: Convergence in the Lyapunov case  
Lecture 6*  Stochastic iterative algorithms: Convergence in the supnorm pseudo contraction mapping case  
Lecture 7*  Monte Carlo policy evaluation  
Lecture 8  Temporal difference learning  
Lecture 9*  TD(0)  
Lecture 10*  TD(lambda)  
Lecture 11  Offline vs online TD  
Lecture 12*  Convergence analysis of TD(0)  
Lecture 13*  TD(lambda): Underlying contraction property, Introduction to Qlearning  
Lecture 14  Qlearning: convergence analysis, need for exploration  
Lecture 15*  Approximate policy evaluation using projected equations  
Lecture 16*  Brief introduction to Markov chains, Contraction properties for projected equation methods  
Lecture 17*  Projected equation methods: LSTD and TD(0) variants  
Lecture 18  Qlearning with function approximation  
Lecture 19*  Introduction to policy gradient methods  
Lecture 20*  REINFORCE algorithm, Overview of actorcritic algorithms 
{}{raw} <p style=“fontsize:0.6em;color:grey”> <i> The lectures marked * are 75 minutes long, while the rest are of 50 minutes duration. <i> <p>