# CS6700: Reinforcement learning

## Course information

**When**: Feb-May 2021**Lectures**: Slot J**Where**: Online**Teaching Assistants**: Nithia Vijayan, Dipayan Sen

## Course Content

#### Markov Decision Processes

Finite horizon model

General theory

DP algorithm

Infinite horizon model: (1) Stochastic shortest path

General theory: Contraction mapping, Bellman equation

Computational solution schemes: Value and policy iteration, convergence analysis

Infinite horizon model: (2) Discounted cost MDPs

General theory: Contraction mapping, Bellman equation

Classical solution techniques: value and policy iteration

#### Reinforcement learning

Stochastic iterative algorithm

Convergence result for contraction mappings

Tabular methods

Monte Carlo policy evaluation

Temporal difference learning

TD(0), TD(lambda)

Convergence analysis

Q-learning and its convergence analysis

Function approximation

Approximate policy iteration and error bounds

Approximate policy evaluation using TD(lambda)

The case of linear function approximation

Convergence analysis

Least-squares methods: LSTD and LSPI

Q-learning

Tentative list of other topics:

Policy-gradient and actor-critic algorithms

Policy gradient theorem

Gradient estimation using likelihood ratios

Actor-critic methods: Compatible features, essential features, advantage updating

Regret minimization in MDPs

Introduction to bandits and UCB algorithm

UCRL and PSRL algorithms

The portion on MDPs roughly coincides with Chapters 1 of Vol. I of *Dynamic programming and optimal control* book of Bertsekas and Chapter 2, 4, 5 and 6 of *Neuro dynamic programming* book of Bertsekas and Tsitsiklis.
For several topics, the book by Sutton and Barto is an useful reference, in particular, to obtain an intuitive understanding. Also, Chapter 6 and 7 of DP/OC Vol II is a useful reference of the advanced topics under RL with function approximation.

Honourable omissions: Neural networks, Average cost models.

The schedule of lectures from the 2019 run of this course is available here

## Grading

Mid-term exam: 25%

Final exam: 25%

Assignments: 10% each

Project: 20%

Paper review: 5%

Scribe: 5%

## Important Dates

Assignment 1: Mar 10, Assignment 2: Apr 20

Mid-term: Mar 24

Final: May 13

Project: May 5 (both report and presentation are due on this date)

Paper review: Apr 26

Scribing: May 10

## Paper review

Each individual student is expected to read a paper from a list to be provided soon and submit a 1-page *critical* review that identifies the strengths and areas of improvement of the paper studied. A paraphrase of the paper content is strongly discouraged.

## Course project

The students work in teams of size two.

Each team is required to implement an RL algorithm of their choice.

A few MDP environments would be provided for tuning the RL algorithm.

The submitted RL algorithm would be evaluated across different MDP environments, and assigned a score based on different performance indicators.

Each team is required to submit a short report providing a description of the chosen algorithm, and its associated parameters.

## Assignments facilitated by Aicrowd

Assignment 1: Click here

Assignment 2: Click here

Project: Click here

## Textbooks/References

D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. I, Athena Scientific, 2017

D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. II, Athena Scientific, 2012

D.P.Bertsekas and J.N.Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996

R.S.Sutton and A.G.Barto, Reinforcement Learning: An Introduction, MIT Press, 2020.