CS6046: Multi-armed bandits
Course information
When: Jan-May 2018
Lectures: Slot D
Where: CS36
Office Hours: Wednesday, 2:30pm to 3:30pm
Course Content
Exploration-exploitation dilemma in stochastic K-armed bandits
Explore-then-commit and epsilon-greedy strategies
Upper Confidence Bound (UCB) algorithm
Thompson sampling (TS) algorithm
Regret upper bounds for UCB and TS
Lower bounds: a) Information theoretic bounds on minimax error, b) Instance-dependent regret lower bounds
Pure exploration in K-armed bandits
Uniform sampling
Successive rejects algorithm
Lower bounds
Adversarial bandits
EXP3, EXP3-IX, EXP4 algorithms
Upper bounds on regret of EXP3 and its variants
Regret lower bounds
Advanced topics
Stochastic linear bandits: Ellipsoidal confidence sets, UCB, Thompson sampling and lower bound.
Adversarial linear bandits: Exp2 with John’s exploration, online mirror descent, lower bound.
Grading
Project: 50%
Homeworks: 40% (Best 2 out of 3)
Paper presentation: 10%
Important Dates
Available on | Submission by | |
---|---|---|
Homework 1 | Jan 27 | Feb 12 |
Homework 2 | Feb 15 | Mar 2 |
Homework 3 | Mar 9 | Mar 26 |
Project proposal | Apr 2 | |
Project report | Apr 30 |
Paper presentation
Each individual student is expected to read a paper from the list here and present the main contributions of the chosen paper and (hopefully) get the class interested enough to discuss more on the presented paper. The paper presentation shall happen in the regular class hours during the month of April.
Course project
The project could be theoretical and/or practical, i.e., the course project could involve implementation of existing bandit algorithms on a sophisticated benchmark (e.g. recommendation systems) and/or original research in the form of a novel bandit formulation or extension of existing theoretical results. Given the limited timeframe of the course, working out a new result or a sufficiently novel extension of an existing result may not be completely feasible and in that case, the project report could turn into a survey of existing literature that the students familiarized with, while attempting to formulate/solve a bandit problem.
For theoretically-oriented projects, the list of papers given here can be a good place to start exploring. For the practically-oriented ones, some pointers are available here.
For theoretically-oriented projects, the report is expected to provide a detailed description of the problem and accompanying algorithms. For the projects geared towards performance evaluation of existing bandit algorithms, the report is expected to summarize the findings of the empirical investigation of state-of-the-art bandit algorithms in one of the application domains.
The evaluation would be as follows: 15% for the project proposal, 25% for the final report and 10% for an oral examination by the instructor.
Textbooks/References
Blog posts on bandit theory by Csaba Szepesvari and Tor Lattimore
Multi-armed bandits book (draft) by Alex Slivkins
Regret analysis of stochastic and non-stochastic multi-armed bandit problems by S. Bubeck and N. Cesa-Bianchi
Homeworks
Schedule
Regret minimization in MAB (Click here for course notes) | ||
---|---|---|
Lecture 1 | Course Organization Motivation for studying bandits | |
Lecture 2 | Bandit learning framework Regret definition, equivalent expression using gaps | 1.1 |
Lecture 3 | Explore-then-commit (ETC) algorithm a brief tour of concentration inequalities: Subgaussianity | 1.2 |
Lecture 4 | Finite sample bounds using Chernoff method end of concentration inequalities tour | 1.3 |
Lecture 5 | Gap dependent regret bound for ETC | 1.4 |
Lecture 6 | Gap-independent regret bound for ETC | 1.4 |
Lecture 7 | UCB algorithm: introduction | 1.5 |
Lecture 8 | UCB: gap-dependent regret bound | 1.5 |
Lecture 9 | UCB: gap-independent regret bound | 1.5 |
Lecture 11 | A brief tour of information theory: Entropy, joint and conditional entropy, chain rule | 1.6 |
Lecture 12 | KL-divergence: Definition, non-negativity, chain rule | 1.6 |
Lecture 13 | Pinsker's inequality, end of information theory tour | 1.6 |
Lecture 14 | Regret lower bound for 2-armed bandits | 1.7 |
Lecture 15 | Regret lower bound for K>2 arms | 1.7 |
Lecture 16 | Instance dependent (asymptotic) lower bound | 1.7 |
Lecture 17 | A brief tour of Bayesian statistics: Parameter estimation (Bernoulli case) | 1.8 |
Lecture 18 | Parameter estimation - case of Normal mean | 1.8 |
Lecture 19 | Bayesian bandit framework and Bayesian regret Thompson sampling aka posterior sampling | 1.9 |
Lecture 19 | Thompson sampling: Bayesian regret bound | 1.9 |
Lecture 20 | Thompson sampling: Bayesian regret bound | 1.9 |
Lecture 21 | Brief introduction to KL-UCB, UCB-V, UCB-improved, deviations of regret | |
Lecture 22 | Brief introduction to frequentist regret bounds of Thompson sampling | |
Best arm identification in MAB | ||
Lecture 1 | Best arm identification (BAI): Fixed budget setting Uniform sampling | 2.1 |
Lecture 2 | Successive rejects (SR) algorithm | 2.2 |
Lecture 3 | Analysis of error probability of SR | 2.2 |
Lecture 4 | Lower bound for BAI in fixed budget setting | 2.3 |
Lecture 5 | Lower bound for BAI in fixed budget setting | 2.3 |
Lecture 6 | Lower bound for BAI in fixed budget setting | 2.3 |
Lecture 7 | Lower bound for BAI in fixed budget setting | 2.3 |
Lecture 8 | Fixed confidence BAI setting Naive uniformly exploring algorithm | 2.4 |
Lecture 9 | Successive elimination with known gaps | 2.5 |
Lecture 9 | Successive elimination with unknown gaps | 2.5 |
Lecture 10 | Analysis of successive elimination algorithm | 2.5 |
Lecture 11 | Lower bound for fixed confidence BAI | 2.6 |
Lecture 12 | Median elimination and a few recent algorithms | |
(Stochastic) Linear bandits | ||
Lecture 1 | Linear bandit formulation, Confidence ellipsoid | |
Lecture 2 | LinUCB algorithm | |
Lecture 3 | Sketch of regret bound | |
Adversarial bandits | ||
Lecture 1 | Prediction with expert advice Halving algorithm, mistake bound | |
Lecture 2 | Weighted majority algorithm, mistake bound | |
Lecture 3 | Exponential weighted majority algorithm | |
Lecture 4 | Regret bound for EWM algorithm | |
Lecture 5 | Adversarial bandits: framework EXP3 algorithm | |
Lecture 6 | Regret bound for EXP3 algorithm | |
Lecture 7 | EXP3.P algorithm, high-probability regret bound | |
Lecture 8 | EXP3-IX algorithm, high-probability regret bound | |
Lecture 9 | A survey of other adversarial bandit settings |