BAnDits Against non-Stationarity and Structure (BADASS)

Taking bandits to the next step

Multi-armed Bandits are a building block of many applications in Machine Learning. Despite its success, many real-world applications face problems dealing with non-stationary and structured data. The goal of this young researcher project is to advance the understanding of multi-armed Bandits both in the usual stationary setting and the context non-stationary and structured environments, which requires combining the bandit literature with research fields such as time series, universal prediction, spectral methods, concentration of measure, etc.

Improved strategies for Bandits

In the stochastic multi-armed bandits, strategies such as Thompson sampling or KL-UCB have been shown to be provably optimal for specific classes of distributions, such as exponential families of dimension 1 or discrete distributions. Much remains to be done to understand optimality for more general classes, and for the analysis of novel promising strategies.

Non-stationary reward signals

One important challenge is to handle non-stationary reward processes. Departing from standard work on individual sequences and worst case approach, it seems important to build better adaptive prediction processes for classes of controlled non-stationarity and revisit aggregation of experts in this context.

Decision making in Structured Environments

Taking into account the structure of the decision making problem is crucial in a number of applications. Structure can be formalized in many ways, using e.g. graphs, spectral methods, Markov decision processes or a combination of them. Structure also naturally appears when dealing with non-stationarity.

The BADASS project

The BADASS project has ended in 2021. The publications of the project can be found at this place

The project "Bandits Against Non-stationarity and Structure" (BADASS) is a JCJC project funded by the French ANR (ANR-16-CE40-0002).
It is headed by Odalric-Ambrym Maillard, together with Emilie Kaufmann and Richard Combes.
We intend to focus on the following objectives:

To broaden the range of optimal strategies for stationary MABs: current strategies are only known to be provably optimal in a limited range of scenarios for which the class of distribution (structure) is perfectly known; also, recent heuristics possibly adaptive to the class need to be further analyzed.
To strengthen the literature on pure sequential prediction (focusing on a single arm) for non-stationary signals via the construction of adaptive confidence sets and a novel measure of complexity: traditional approaches consider a worst-case scenario and are thus overly conservative and non-adaptive to simpler signals.
To embed the low-rank matrix completion and spectral methods in the context of reinforcement learning, and further study models of structured environments: promising heuristics in the context of e.g. contextual MABs or Predictive State Representations require stronger theoretical guarantees.

Taking bandits to the next step

Improved strategies for Bandits

Non-stationary reward signals

Decision making in Structured Environments

Dissemination

The BADASS project

Visiting researcher: Junpei Komiyama

Pr. Tze Leung Lai at the EWRL

Postdoctorant member

Hiring of a new Ph.D. student.

Kick-off meeting