**(WP3) Monte-Carlo Tree Search by Best Arm Identification, ** in *NIPS 2017 – 31st Annual Conference on Neural Information Processing Systems*, Dec 2017, Long Beach, United States. HaL

*«Recent advances in bandit tools and techniques for sequential learning are steadily enabling new applications and are promising the resolution of a range of challenging related problems. We study the game tree search problem, where the goal is to quickly identify the optimal move in a given game tree by sequentially sampling its stochastic payoffs. We develop new algorithms for trees of arbitrary depth, that operate by summarizing all deeper levels of the tree into confidence intervals at depth one, and applying a best arm identification procedure at the root. We prove new sample complexity guarantees with a refined dependence on the problem instance. We show experimentally that our algorithms outperform existing elimination-based algorithms and match previous special-purpose methods for depth-two trees.»*

**(WP2) Multi-Armed Bandit Learning in IoT Networks: Learning helps even in non-stationary settings**, in *CROWNCOM 2017 – 12th EAI International Conference on Cognitive Radio Oriented Wireless Networks*, Sep 2017, Lisbon, Portugal. HaL

*«La mise en place des futurs réseaux Internet des Objets (IoT) nécessitera de supporter de plus en plus d’appareils communicants. Nous prouvons que les objets adaptatifs, dans des bandes non licenciées, peuvent utiliser les algorithmes d’apprentissage de type Bandit Multi-Bras (MAB) pour améliorer l’exploitation des ressources. Nous évaluons les performances de deux algorithmes classiques d’apprentissage MAB, UCB1 et Thompson Sampling, pour prendre en charge la prise de décision décentralisée d’Analyse de Spectre, appliquée aux réseaux IoT, ainsi que les performances d’apprentissage avec un nombre croissant d’objets intelligents. Nous montrons que l’utilisation d’algorithmes d’apprentissage aide à adapter un plus grand nombre de dispositifs dans de tels réseaux, même lorsque tous les appareils finaux sont intelligents et changent de canal de façon dynamique. Dans le scénario étudié, l’apprentissage stochastique (MAB) fournit un gain allant jusqu’à 16% en terme de probabilités de transmission réussie, et a des performances quasi optimales même dans les situations non stationnaires et non i.i.d. avec une majorité d’appareils intelligents.»*

**(WP2) Efficient tracking of a growing number of experts**, J. Mourtada, O.-A. Maillard in Algorithmic Learning Theory (ALT), 2017.

*«We consider a variation on the problem of prediction with expert advice, where new forecasters that were unknown until then may appear at each round. As often in prediction with expert advice, designing an algorithm that achieves near-optimal regret guarantees is straightforward, using aggregation of experts. However, when the comparison class is sufficiently rich, for instance when the best expert and the set of experts itself change over time, such strategies require to maintain a prohibitive number of weights (typically exponential with the time horizon). By contrast, designing strategies that both achieve a near-optimal regret and maintain a reasonable number of weights is highly non-trivial. We consider three increasingly challenging objectives (simple regret, shifting regret and sparse shifting regret) that extend existing notions defined for a fixed expert ensemble; in each case, we design strategies that achieve tight regret bounds, adaptive to the parameters of the comparison class, while being computationally inexpensive. Moreover, our algorithms are anytime, agnostic to the number of incoming experts and completely parameter-free. Such remarkable results are made possible thanks to two simple but highly effective recipes: first the “abstention trick” that comes from the specialist framework and enables to handle the least challenging notions of regret, but is limited when addressing more sophisticated objectives. Second, the “muting trick” that we introduce to give more flexibility. We show how to combine these two tricks in order to handle the most challenging class of comparison strategies.»*

**(WP1) Boundary Crossing Probabilities for General Exponential Families**, O.-A. Maillard in Algorithmic Learning Theory (ALT), 2017. ArXiv.

*«We consider parametric exponential families of dimension K on the real line. We study a variant of \textit{boundary crossing probabilities} coming from the multi-armed bandit literature, in the case when the real-valued distributions form an exponential family of dimension K. Formally, our result is a concentration inequality that bounds the probability that B^ψ(θ^n,θ⋆)≥f(t/n)/n, where θ⋆ is the parameter of an unknown target distribution, θ^n is the empirical parameter estimate built from n observations, ψ is the log-partition function of the exponential family and Bψ is the corresponding Bregman divergence. From the perspective of stochastic multi-armed bandits, we pay special attention to the case when the boundary function f is logarithmic, as it is enables to analyze the regret of the state-of-the-art KL-UCB and KL-UCB+ strategies, whose analysis was left open in such generality. Indeed, previous results only hold for the case when K=1, while we provide results for arbitrary finite dimension K, thus considerably extending the existing results. Perhaps surprisingly, we highlight that the proof techniques to achieve these strong results already existed three decades ago in the work of T.L. Lai, and were apparently forgotten in the bandit community. We provide a modern rewriting of these beautiful techniques that we believe are useful beyond the application to stochastic multi-armed bandits.»*

**(WP3) Spectral Learning from a Single Trajectory under Finite-State Policies**, B. Balle, O.-A. Maillard in International conference on Machine Learning, 2017.

*«**We present spectral methods of moments for **learning sequential models from a single trajectory, **in stark contrast with the classical literature **that assumes the availability of multiple i.i.d. **trajectories. Our approach leverages an efficient* *SVD-based learning algorithm for weighted automata **and provides the first rigorous analysis **for learning many important models using dependent* *data. We state and analyze the algorithm under **three increasingly difficult scenarios: probabilistic **automata, stochastic weighted automata, **and reactive predictive state representations controlled* *by a finite-state policy. Our proofs include **novel tools for studying mixing properties **of stochastic weighted automata.** »*