# Four papers on multi-armed bandit and reinforcement learning accepted at AI&Stats’17

We had four papers accepted at AI&Stats’17 on multi-armed bandit and reinforcement learning. The most interesting result is the first regret bound on learning in MDPs with options. For the first time we have an explicit measure of the impact of options on the exploration-exploitation performance of a reinforcement learning algorithm. This paves the way to very interesting developments, where we could finally construct options that are tailored to the environments at hand and obtain a hierarchy of skills that provably improve the learning performance.

Exploration-Exploitation in MDPs with Options (Ronan Fruit and A. Lazaric)

While a large body of empirical results show that temporally-extended actions and options may significantly affect the learning performance of an agent, the theoretical understanding of how and when options can be beneficial in online reinforcement learning is relatively limited. In this paper, we derive an upper and lower bound on the regret of a variant of UCRL using options. While we first analyze the algorithm in the general case of semi-Markov decision processes (SMDPs), we show how these results can be translated to the specific case of MDPs with options and we illustrate simple scenarios in which the regret of learning with options can be provably much smaller than the regret suffered when learning with primitive actions.

Trading off Rewards and Errors in Multi-Armed Bandits (A. Erraqabi, A. Lazaric, M. Valko, E. Brunskill, Y.-E. Liu)

In multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small \textit{estimation errors}) but, at the same time, the system needs to avoid giving a bad experience to the players, who may leave the system permanently (large \textit{reward}). In this paper, we formalize this tradeoff and introduce the \forcing algorithm whose performance is provably close to the best possible tradeoff strategy. Finally, we demonstrate on real-world educational data that \forcing returns \textit{useful} information about the arms without compromising the overall reward.

Linear Thompson Sampling Revisited (Marc Abeille and A. Lazaric)

We derive an alternative proof for the regret of Thompson sampling (\ts) in the stochastic linear bandit setting. While we obtain a regret bound of order $\wt{O}(d^{3/2}\sqrt{T})$ as in previous results, the proof sheds new light on the functioning of the \ts. We leverage on the structure of the problem to show how the regret is related to the sensitivity (i.e., the gradient) of the objective function and how selecting optimal arms associated to \textit{optimistic} parameters does control it. Thus we show that \ts can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrt{d}$ regret factor compared to a UCB-like approach. Furthermore, we show that our proof can be readily applied to regularized linear optimization and generalized linear model problems.

Thompson Sampling for Linear-Quadratic Control Problems (Marc Abeille and A. Lazaric)

We consider the exploration-exploitation tradeoff in linear quadratic (LQ) control problems, where the state dynamics is linear and the cost function is quadratic in states and controls. We analyze the regret of Thompson sampling (\ts) (a.k.a. posterior-sampling for reinforcement learning) in the frequentist setting, i.e., when the parameters characterizing the LQ dynamics are fixed. Despite the empirical and theoretical success in a wide range of problems from multi-armed bandit to linear bandit, we show that when studying the frequentist regret \ts in control problems, we need to trade-off the frequency of sampling optimistic parameters and the frequency of switches in the control policy. This results in an overall regret of $O(T^{2/3})$, which is significantly worse than the regret $O(\sqrt{T})$ achieved by the optimism-in-face-of-uncertainty algorithm in LQ control problems.