CS6700: Reinforcement learning
Course information
When: JulNov 2018
Lectures: Slot J
Where: CS26
Teaching Assistants: Ajay Kumar Pandey, Nirav Bhavsar, Nithia V
Office Hours: Wednesday, 3:30pm to 4:30pm
Course Content
Markov Decision Processes
Reinforcement learning
The portion on MDPs roughly coincides with Chapters 1 of Vol. I of Dynamic programming and optimal control book of Bertsekas and Chapter 2, 4, 5 and 6 of Neuro dynamic programming book of Bertsekas and Tsitsiklis.
For several topics, the book by Sutton and Barto is an useful reference, in particular, to obtain an intuitive understanding. Also, Chapter 6 and 7 of DP/OC Vol II is a useful reference of the advanced topics under RL with function approximation.
Honourable omissions: Neural networks, Average cost models.
Important Dates

On 
Quiz 1 
Sep 8 
Quiz 2 
Oct 7 

Available on 
Submission by

Homework 1 
Aug 20 
Sep 2 
Homework 2 
Sep 10 
Sep 20 
Homework 3 
Sep 22 
Oct 2 
Homework 4 
Oct 7 
Oct 17 
Paper summary 

Oct 30 
Project proposal 

Oct 30 
Project report 

Nov 29 
Paper review
Each individual student is expected to read a paper from a list to be provided soon and submit a 1page critical review that identifies the strengths and areas of improvement of the paper studied. A paraphrase of the paper content is strongly discouraged. A list of interesting (albeit nonexhaustive) RL papers will be made available soon.
Course project
The project could be theoretical and/or practical.
Theoretical projects could involve a novel problem formulation or algorithmic adaptation or extension of existing theory of RL algorithms.
Practical projects could involve implementation of existing or new RL algorithms on a new benchmark.
Given the limited timeframe of the course, working out a new result in theory may not be completely feasible and in that case, the project could be converted into an empirical investigation of the algorithms that the students familiarized with, while attempting to formulate/solve a RL problem.
The report is expected to provide a detailed description of the problem, accompanying algorithms and novel theoretical results or findings from an empirical investigation.
The evaluation would be as follows:
5% for the project proposal, 25% for the final report and 5% for a 15minute presentation of the work done.
Textbooks/References
D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. I, Athena Scientific, 2017
D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. II, Athena Scientific, 2012
D.P.Bertsekas and J.N.Tsitsiklis, NeuroDynamic Programming, Athena Scientific, 1996
R.S.Sutton and A.G.Barto, Reinforcement Learning: An Introduction, MIT Press, 1998. For an updated draft, Click here
Schedule

Theory of Markov decision processes 

Lecture 1 
Finite horizon MDPs: a few examples 
1.1 
Lecture 2* 
Finite horizon MDP framework, Open vs. closed loop policies 
1.2,1.3 
Lecture 3* 
Principle of optimality and the DP algorithm 
1.4 
Lecture 4* 
Proof of optimality of the DP algorithm 
1.4 
Lecture 5 
Two examples: linear system with quadratic cost and asset selling 
1.4 
Lecture 6* 
Infinite horizon MDPs: an introduction and preview of the main results 
2.1 
Lecture 7* 
Stochastic shortest path (SSP) formulation 
2.2 
Lecture 8 
Proper policies, Bellman optimality operator, Fixed point equation for policy evaluation 
2.2 
Lecture 9* 
Bellman's equation and optimality conditions 
2.2 
Lecture 10 
Bellman's equation and optimality conditions (contd) 
2.2 
Lecture 11* 
SSP examples 

Lecture 12 
HW1 discussion 

Lecture 13 
Value iteration 

Lecture 14* 
Policy iteration, introduction to contraction mappings 

Lecture 15* 
Contraction mappings in an SSP 

Lecture 16* 
Discounted cost MDP formulation 

Lecture 17 
Bellman equation, optimality conditions 

Lecture 18* 
Underlying contraction properties, value iteration, policy iteration 

Lecture 19* 
Linear programming approach, Qlearning 


Reinforcement learning 

Lecture 1 
Introduction to reinforcement learning problem, connection to stochastic approximation 

Lecture 2* 
First and secondorder optimality conditions, Gradient descent algorithms 

Lecture 3* 
Probability recap: introduction to sigma fields 

Lecture 4* 
Probability recap: conditional distributions and expectation, martingale differences 

Lecture 5 
Stochastic iterative algorithms: Convergence in the Lyapunov case 

Lecture 6* 
Stochastic iterative algorithms: Convergence in the supnorm pseudo contraction mapping case 

Lecture 7* 
Monte Carlo policy evaluation 

Lecture 8 
Temporal difference learning 

Lecture 9* 
TD(0) 

Lecture 10* 
TD(lambda) 

Lecture 11 
Offline vs online TD 

Lecture 12* 
Convergence analysis of TD(0) 

Lecture 13* 
TD(lambda): Underlying contraction property, Introduction to Qlearning 

Lecture 14 
Qlearning: convergence analysis, need for exploration 

Lecture 15* 
Approximate policy evaluation using projected equations 

Lecture 16* 
Brief introduction to Markov chains, Contraction properties for projected equation methods 

Lecture 17* 
Projected equation methods: LSTD and TD(0) variants 

Lecture 18 
Qlearning with function approximation 

Lecture 19* 
Introduction to policy gradient methods 

Lecture 20* 
REINFORCE algorithm, Overview of actorcritic algorithms 

The lectures marked * are 75 minutes long, while the rest are of 50 minutes duration.
