CS6046: Multiarmed bandits
Course information
When: JanMay 2018
Lectures: Slot D
Where: CS36
Office Hours: Wednesday, 2:30pm to 3:30pm
Course Content
Explorationexploitation dilemma in stochastic Karmed bandits
Explorethencommit and epsilongreedy strategies
Upper Confidence Bound (UCB) algorithm
Thompson sampling (TS) algorithm
Regret upper bounds for UCB and TS
Lower bounds: a) Information theoretic bounds on minimax error, b) Instancedependent regret lower bounds
Pure exploration in Karmed bandits
Uniform sampling
Successive rejects algorithm
Lower bounds
Adversarial bandits
EXP3, EXP3IX, EXP4 algorithms
Upper bounds on regret of EXP3 and its variants
Regret lower bounds
Advanced topics
Stochastic linear bandits: Ellipsoidal confidence sets, UCB, Thompson sampling and lower bound.
Adversarial linear bandits: Exp2 with John’s exploration, online mirror descent, lower bound.
Grading
Project: 50%
Homeworks: 40% (Best 2 out of 3)
Paper presentation: 10%
Important Dates
Available on  Submission by 


Homework 1  Jan 27  Feb 12 
Homework 2  Feb 15  Mar 2 
Homework 3  Mar 9  Mar 26 
Project proposal  Apr 2  
Project report  Apr 30 
Paper presentation
Each individual student is expected to read a paper from the list here and present the main contributions of the chosen paper and (hopefully) get the class interested enough to discuss more on the presented paper. The paper presentation shall happen in the regular class hours during the month of April.
Course project
The project could be theoretical and/or practical, i.e., the course project could involve implementation of existing bandit algorithms on a sophisticated benchmark (e.g. recommendation systems) and/or original research in the form of a novel bandit formulation or extension of existing theoretical results. Given the limited timeframe of the course, working out a new result or a sufficiently novel extension of an existing result may not be completely feasible and in that case, the project report could turn into a survey of existing literature that the students familiarized with, while attempting to formulate/solve a bandit problem.
For theoreticallyoriented projects, the list of papers given here can be a good place to start exploring. For the practicallyoriented ones, some pointers are available here.
For theoreticallyoriented projects, the report is expected to provide a detailed description of the problem and accompanying algorithms. For the projects geared towards performance evaluation of existing bandit algorithms, the report is expected to summarize the findings of the empirical investigation of stateoftheart bandit algorithms in one of the application domains.
The evaluation would be as follows: 15% for the project proposal, 25% for the final report and 10% for an oral examination by the instructor.
Textbooks/References
Blog posts on bandit theory by Csaba Szepesvari and Tor Lattimore
Multiarmed bandits book (draft) by Alex Slivkins
Regret analysis of stochastic and nonstochastic multiarmed bandit problems by S. Bubeck and N. CesaBianchi
Homeworks
Schedule
Regret minimization in MAB (Click here for course notes)  

Lecture 1  Course Organization Motivation for studying bandits 

Lecture 2  Bandit learning framework Regret definition, equivalent expression using gaps 
1.1 
Lecture 3  Explorethencommit (ETC) algorithm a brief tour of concentration inequalities: Subgaussianity 
1.2 
Lecture 4  Finite sample bounds using Chernoff method end of concentration inequalities tour 
1.3 
Lecture 5  Gap dependent regret bound for ETC  1.4 
Lecture 6  Gapindependent regret bound for ETC  1.4 
Lecture 7  UCB algorithm: introduction  1.5 
Lecture 8  UCB: gapdependent regret bound  1.5 
Lecture 9  UCB: gapindependent regret bound  1.5 
Lecture 11  A brief tour of information theory: Entropy, joint and conditional entropy, chain rule 
1.6 
Lecture 12  KLdivergence: Definition, nonnegativity, chain rule  1.6 
Lecture 13  Pinsker's inequality, end of information theory tour  1.6 
Lecture 14  Regret lower bound for 2armed bandits  1.7 
Lecture 15  Regret lower bound for K>2 arms  1.7 
Lecture 16  Instance dependent (asymptotic) lower bound  1.7 
Lecture 17  A brief tour of Bayesian statistics: Parameter estimation (Bernoulli case) 
1.8 
Lecture 18  Parameter estimation  case of Normal mean  1.8 
Lecture 19  Bayesian bandit framework and Bayesian regret Thompson sampling aka posterior sampling 
1.9 
Lecture 19  Thompson sampling: Bayesian regret bound  1.9 
Lecture 20  Thompson sampling: Bayesian regret bound  1.9 
Lecture 21  Brief introduction to KLUCB, UCBV, UCBimproved, deviations of regret  
Lecture 22  Brief introduction to frequentist regret bounds of Thompson sampling  
Best arm identification in MAB  
Lecture 1  Best arm identification (BAI): Fixed budget setting Uniform sampling 
2.1 
Lecture 2  Successive rejects (SR) algorithm  2.2 
Lecture 3  Analysis of error probability of SR  2.2 
Lecture 4  Lower bound for BAI in fixed budget setting  2.3 
Lecture 5  Lower bound for BAI in fixed budget setting  2.3 
Lecture 6  Lower bound for BAI in fixed budget setting  2.3 
Lecture 7  Lower bound for BAI in fixed budget setting  2.3 
Lecture 8  Fixed confidence BAI setting Naive uniformly exploring algorithm 
2.4 
Lecture 9  Successive elimination with known gaps  2.5 
Lecture 9  Successive elimination with unknown gaps  2.5 
Lecture 10  Analysis of successive elimination algorithm  2.5 
Lecture 11  Lower bound for fixed confidence BAI  2.6 
Lecture 12  Median elimination and a few recent algorithms  
(Stochastic) Linear bandits  
Lecture 1  Linear bandit formulation, Confidence ellipsoid  
Lecture 2  LinUCB algorithm  
Lecture 3  Sketch of regret bound  
Adversarial bandits  
Lecture 1  Prediction with expert advice Halving algorithm, mistake bound 

Lecture 2  Weighted majority algorithm, mistake bound  
Lecture 3  Exponential weighted majority algorithm  
Lecture 4  Regret bound for EWM algorithm  
Lecture 5  Adversarial bandits: framework EXP3 algorithm 

Lecture 6  Regret bound for EXP3 algorithm  
Lecture 7  EXP3.P algorithm, highprobability regret bound  
Lecture 8  EXP3IX algorithm, highprobability regret bound  
Lecture 9  A survey of other adversarial bandit settings 