Back to Courses

# Reinforcement Learning

Reinforcement learning is a paradigm that aims to model the trial-and-error learning process that is needed in many problem situations where explicit instructive signals are not available. It has roots in operations research, behavioral psychology and AI. The goal of the course is to introduce the basic mathematical foundations of reinforcement learning, as well as highlight some of the recent directions of research.

The tables below enlists the courses materials for Week 0 to Week 12. Each topic has both YouTube link and VideoKen link.

## Week 0 - Preparatory Material

 Probability tutorial - 1 Probability tutorial - 2 Linear algebra tutorial - 1 Linear algebra tutorial - 2 Assignment 0 Solution 0

## Week 1 - Introduction to RL and Immediate RL

 Introduction to RL RL framework and applications Introduction to immediate RL Bandit optimalities Value function based methods Assignment 1 Solution 1

## Week 2 - Bandit Algorithms

 UCB 1 Concentration bounds UCB 1 Theorem PAC bounds Median elimination Thompson sampling Assignment 2 Solution 2
• Auer, P.; Cesa-Bianchi, N.; Fischer, P. 2002. Finite-time Analysis of the Multiarmed Bandit Problem.
• Auer, P.; Ortner, R. 2010. UCB Revisited: Improved Regret Bounds for the Stochastic Multi-Armed Bandit Problem.
• Even-Dar, E.; Mannor, S.; Mansour, Y. 2006. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems.
• Tutorial on OFUL (Szepesvari, C.) Part 1 | Part 2 | Part 3
• ## Week 3 - Policy Gradient Methods & Introduction to Full RL

 Policy search REINFORCE Contextual bandits Full RL introduction Returns, value functions & MDPs Assignment 3 Solution 3
• Notes on REINFORCE algorithm
• ## Week 4 - MDP Formulation, Bellman Equations & Optimality Proofs

 MDP modelling Bellman equation Bellman optimality equation Cauchy sequence & Green's equation Banach fixed point theorem Convergence proof Assignment 4 Solution 4

## Week 5 - Dynamic Programming & Monte Carlo Methods

 LPI convergence Value iteration Policy iteration Dynamic programming Monte Carlo Control in Monte Carlo Assignment 5 Solution 5

## Week 6 - Monte Carlo & Temporal Difference Methods

 Off Policy MC UCT TD(0) TD(0) control Q-learning Afterstate Assignment 6 Solution 6

## Week 7 - Eligibility Traces

 Eligibility traces Backward view of eligibility traces Eligibility trace control Thompson sampling recap Assignment 7 Solution 7

## Week 8 - Function Approximation

 Function approximation Linear parameterization State aggregation methods Function approximation & eligibility traces LSTD & LSTDQ LSPI & Fitted Q Assignment 8 Solution 8

## Week 9 - DQN, Fitted Q & Policy Gradient Approaches

 DQN & Fitted Q-iteration Policy gradient approach Actor critic & REINFORCE REINFORCE (cont'd) Policy gradient with function approximation Assignment 9 Solution 9
• Notes on Policy Gradient Algorithms
• ## Week 10 - Hierarchical Reinforcement Learning

 Hierarchical reinforcement learning Types of optimality Semi-Markov decision processes Options Learning with options Hierarchical abstract machines Assignment 10 Solution 10