Publications HAL du projet ANR. 16-CE40-0002

2020

Conference papers

titre
Sub-sampling for Efficient Non-Parametric Bandit Exploration
auteur
Dorian Baudry, Emilie Kaufmann, Odalric-Ambrym Maillard
article
NeurIPS 2020, Dec 2020, Vancouver, Canada
resume
In this paper we propose the first multi-armed bandit algorithm based on re-sampling that achieves asymptotically optimal regret simultaneously for different families of arms (namely Bernoulli, Gaussian and Poisson distributions). Unlike Thompson Sampling which requires to specify a different prior to be optimal in each case, our proposal RB-SDA does not need any distribution-dependent tuning. RB-SDA belongs to the family of Sub-sampling Duelling Algorithms (SDA) which combines the sub-sampling idea first used by the BESA [1] and SSMC [2] algorithms with different sub-sampling schemes. In particular, RB-SDA uses Random Block sampling. We perform an experimental study assessing the flexibility and robustness of this promising novel approach for exploration in bandit models.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-02977552/file/sda_hal.pdf BibTex
titre
A Practical Algorithm for Multiplayer Bandits when Arm Means Vary Among Players
auteur
Etienne Boursier, Emilie Kaufmann, Abbas Mehrabian, Vianney Perchet
article
AISTATS 2020 - 23rd International Conference on Artificial Intelligence and Statistics, Aug 2020, Palermo, Italy
resume
We study a multiplayer stochastic multi-armed bandit problem in which players cannot communicate, and if two or more players pull the same arm, a collision occurs and the involved players receive zero reward. We consider the challenging heterogeneous setting, in which different arms may have different means for different players, and propose a new and efficient algorithm that combines the idea of leveraging forced collisions for implicit communication and that of performing matching eliminations. We present a finite-time analysis of our algorithm, giving the first sublinear minimax regret bound for this problem, and prove that if the optimal assignment of players to arms is unique, our algorithm attains the optimal O(ln(T )) regret, solving an open question raised at NeurIPS 2018 by Bistritz and Leshem (2018).
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-02006069/file/aistats20.pdf BibTex
titre
Tightening Exploration in Upper Confidence Reinforcement Learning
auteur
Odalric-Ambrym Maillard, Hippolyte Bourel, Mohammad Talebi
article
International Conference on Machine Learning, Jul 2020, Vienna, Austria
resume
The upper confidence reinforcement learning (UCRL2) algorithm introduced in (Jaksch et al., 2010) is a popular method to perform regret minimization in unknown discrete Markov Decision Processes under the average-reward criterion. Despite its nice and generic theoretical regret guarantees , this algorithm and its variants have remained until now mostly theoretical as numerical experiments in simple environments exhibit long burn-in phases before the learning takes place. In pursuit of practical efficiency, we present UCRL3, following the lines of UCRL2, but with two key modifications: First, it uses state-of-the-art time-uniform concentration inequalities to compute confidence sets on the reward and (component-wise) transition distributions for each state-action pair. Furthermore , to tighten exploration, it uses an adap-tive computation of the support of each transition distribution, which in turn enables us to revisit the extended value iteration procedure of UCRL2 to optimize over distributions with reduced support by disregarding low probability transitions, while still ensuring near-optimism. We demonstrate , through numerical experiments in standard environments, that reducing exploration this way yields a substantial numerical improvement compared to UCRL2 and its variants. On the theoretical side, these key modifications enable us to derive a regret bound for UCRL3 improving on UCRL2, that for the first time makes appear notions of local diameter and local effective support, thanks to variance-aware concentration bounds.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-03000664/file/ICML2020_UCRL3_FinalVersion.pdf BibTex
titre
Restarted Bayesian Online Change-point Detector achieves Optimal Detection Delay
auteur
Réda Alami, Odalric-Ambrym Maillard, Raphael Féraud
article
International Conference on Machine Learning, Jul 2020, Wien, Austria
resume
In this paper, we consider the problem of sequential change-point detection where both the changepoints and the distributions before and after the change are assumed to be unknown. For this problem of primary importance in statistical and sequential learning theory, we derive a variant of the Bayesian Online Change Point Detector proposed by (Fearnhead & Liu, 2007) which is easier to analyze than the original version while keeping its powerful message-passing algorithm. We provide a non-asymptotic analysis of the false-alarm rate and the detection delay that matches the existing lower-bound. We further provide the first explicit high-probability control of the detection delay for such approach. Experiments on synthetic and realworld data show that this proposal outperforms the state-of-art change-point detection strategy, namely the Improved Generalized Likelihood Ratio (Improved GLR) while compares favorably with the original Bayesian Online Change Point Detection strategy.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-03021712/file/paper.pdf BibTex
titre
Solving Bernoulli Rank-One Bandits with Unimodal Thompson Sampling
auteur
Cindy Trinh, Emilie Kaufmann, Claire Vernade, Richard Combes
article
ALT 2020 - 31st International Conference on Algorithmic Learning Theory, Feb 2020, San Diego, United States. pp.1 - 28
resume
Stochastic Rank-One Bandits (Katarya et al, (2017a,b)) are a simple framework for regret minimization problems over rank-one matrices of arms. The initially proposed algorithms are proved to have logarithmic regret, but do not match the existing lower bound for this problem. We close this gap by first proving that rank-one bandits are a particular instance of unimodal bandits, and then providing a new analysis of Unimodal Thompson Sampling (UTS), initially proposed by Paladino et al (2017). We prove an asymptotically optimal regret bound on the frequentist regret of UTS and we support our claims with simulations showing the significant improvement of our method compared to the state-of-the-art.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-02396943/file/Trinh20.pdf BibTex
titre
Fixed-confidence guarantees for Bayesian best-arm identification
auteur
Xuedong Shang, Rianne de Heide, Emilie Kaufmann, Pierre Ménard, Michal Valko
article
International Conference on Artificial Intelligence and Statistics, 2020, Palermo, Italy
resume
We investigate and provide new insights on the sampling rule called Top-Two Thompson Sampling (TTTS). In particular, we justify its use for fixed-confidence best-arm identification. We further propose a variant of TTTS called Top-Two Transportation Cost (T3C), which disposes of the computational burden of TTTS. As our main contribution, we provide the first sample complexity analysis of TTTS and T3C when coupled with a very natural Bayesian stopping rule, for bandits with Gaussian rewards, solving one of the open questions raised by Russo (2016). We also provide new posterior convergence results for TTTS under two models that are commonly used in practice: bandits with Gaussian and Bernoulli rewards and conjugate priors.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-02330187/file/main.pdf BibTex

2019

Journal articles

titre
Asymptotically Optimal Algorithms for Budgeted Multiple Play Bandits
auteur
Alexander Luedtke, Emilie Kaufmann, Antoine Chambaz
article
Machine Learning, Springer Verlag, 2019, 108 (11), pp.1919-1949. ⟨10.1007/s10994-019-05799-x⟩
resume
We study a generalization of the multi-armed bandit problem with multiple plays where there is a cost associated with pulling each arm and the agent has a budget at each time that dictates how much she can expect to spend. We derive an asymptotic regret lower bound for any uniformly efficient algorithm in our setting. We then study a variant of Thompson sampling for Bernoulli rewards and a variant of KL-UCB for both single-parameter exponential families and bounded, finitely supported rewards. We show these algorithms are asymptotically optimal, both in rate and leading problem-dependent constants, including in the thick margin setting where multiple arms fall on the decision boundary.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01338733/file/LKC19_HaL.pdf BibTex

Conference papers

titre
Budgeted Reinforcement Learning in Continuous State Space
auteur
Nicolas Carrara, Edouard Leurent, Romain Laroche, Tanguy Urvoy, Odalric-Ambrym Maillard, Olivier Pietquin
article
Conference on Neural Information Processing Systems, Dec 2019, Vancouver, Canada
resume
A Budgeted Markov Decision Process (BMDP) is an extension of a Markov Decision Process to critical applications requiring safety constraints. It relies on a notion of risk implemented in the shape of a cost signal constrained to lie below an-adjustable-threshold. So far, BMDPs could only be solved in the case of finite state spaces with known dynamics. This work extends the state-of-the-art to continuous spaces environments and unknown dynamics. We show that the solution to a BMDP is a fixed point of a novel Budgeted Bellman Optimality operator. This observation allows us to introduce natural extensions of Deep Reinforcement Learning algorithms to address large-scale BMDPs. We validate our approach on two simulated applications: spoken dialogue and autonomous driving.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-02375727/file/9128-budgeted-reinforcement-learning-in-continuous-state-space.pdf BibTex
titre
Regret Bounds for Learning State Representations in Reinforcement Learning
auteur
Ronald Ortner, Matteo Pirotta, Ronan Fruit, Alessandro Lazaric, Odalric-Ambrym Maillard
article
Conference on Neural Information Processing Systems, Dec 2019, Vancouver, Canada
resume
We consider the problem of online reinforcement learning when several state representations (mapping histories to a discrete state space) are available to the learning agent. At least one of these representations is assumed to induce a Markov decision process (MDP), and the performance of the agent is measured in terms of cumulative regret against the optimal policy giving the highest average reward in this MDP representation. We propose an algorithm (UCB-MS) with O(√ T) regret in any communicating MDP. The regret bound shows that UCB-MS automatically adapts to the Markov model and improves over the currently known best bound of order O(T 2/3).
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-02375715/file/9435-regret-bounds-for-learning-state-representations-in-reinforcement-learning.pdf BibTex
titre
Learning Multiple Markov Chains via Adaptive Allocation
auteur
Mohammad Talebi, Odalric-Ambrym Maillard
article
Advances in Neural Information Processing Systems 32 (NIPS 2019), Dec 2019, Vancouver, Canada
resume
We study the problem of learning the transition matrices of a set of Markov chains from a single stream of observations on each chain. We assume that the Markov chains are ergodic but otherwise unknown. The learner can sample Markov chains sequentially to observe their states. The goal of the learner is to sequentially select various chains to learn transition matrices uniformly well with respect to some loss function. We introduce a notion of loss that naturally extends the squared loss for learning distributions to the case of Markov chains, and further characterize the notion of being uniformly good in all problem instances. We present a novel learning algorithm that efficiently balances exploration and exploitation intrinsic to this problem, without any prior knowledge of the chains. We provide finite-sample PAC-type guarantees on the performance of the algorithm. Further, we show that our algorithm asymptotically attains an optimal loss.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-02387345/file/AL_MC_HAL.pdf BibTex
titre
Model-Based Reinforcement Learning Exploiting State-Action Equivalence
auteur
Mahsa Asadi, Mohammad Talebi, Hippolyte Bourel, Odalric-Ambrym Maillard
article
ACML 2019, Proceedings of Machine Learning Research, Nov 2019, Nagoya, Japan. pp.204 - 219
resume
Leveraging an equivalence property in the state-space of a Markov Decision Process (MDP) has been investigated in several studies. This paper studies equivalence structure in the reinforcement learning (RL) setup, where transition distributions are no longer assumed to be known. We present a notion of similarity between transition probabilities of various state-action pairs of an MDP, which naturally defines an equivalence structure in the state-action space. We present equivalence-aware confidence sets for the case where the learner knows the underlying structure in advance. These sets are provably smaller than their corresponding equivalence-oblivious counterparts. In the more challenging case of an unknown equivalence structure, we present an algorithm called ApproxEquivalence that seeks to find an (approximate) equivalence structure, and define confidence sets using the approximate equivalence. To illustrate the efficacy of the presented confidence sets, we present C-UCRL, as a natural modification of UCRL2 for RL in undiscounted MDPs. In the case of a known equivalence structure, we show that C-UCRL improves over UCRL2 in terms of regret by a factor of SA/C, in any communicating MDP with S states, A actions, and C classes, which corresponds to a massive improvement when C SA. To the best of our knowledge, this is the first work providing regret bounds for RL when an equivalence structure in the MDP is efficiently exploited. In the case of an unknown equivalence structure, we show through numerical experiments that C-UCRL combined with ApproxEquivalence outperforms UCRL2 in ergodic MDPs.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-02378887/file/2019%20-%20Model-based%20Reinforcement%20Learning%20Exploiting%20State-Action%20Equivalence.pdf BibTex
titre
Practical Open-Loop Optimistic Planning
auteur
Edouard Leurent, Odalric-Ambrym Maillard
article
European Conference on Machine Learning, Sep 2019, Würzburg, Germany
resume
We consider the problem of online planning in a Markov Decision Process when given only access to a generative model, restricted to open-loop policies-i.e. sequences of actions-and under budget constraint. In this setting, the Open-Loop Optimistic Planning (OLOP) algorithm enjoys good theoretical guarantees but is overly conservative in practice, as we show in numerical experiments. We propose a modified version of the algorithm with tighter upper-confidence bounds, KL-OLOP, that leads to better practical performances while retaining the sample complexity bound. Finally, we propose an efficient implementation that significantly improves the time complexity of both algorithms.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-02375697/file/407.pdf BibTex
titre
General parallel optimization without a metric
auteur
Xuedong Shang, Emilie Kaufmann, Michal Valko
article
Algorithmic Learning Theory, 2019, Chicago, United States
resume
Hierarchical bandits are an approach for global optimization of extremely irregular functions. This paper provides new elements regarding POO, an adaptive meta-algorithm that does not require the knowledge of local smoothness of the target function. We first highlight the fact that the subroutine algorithm used in POO should have a small regret under the assumption of local smoothness with respect to the chosen partitioning, which is unknown if it is satisfied by the standard subroutine HOO. In this work, we establish such regret guarantee for HCT, which is another hierarchical optimistic optimization algorithm that needs to know the smoothness. This confirms the validity of POO. We show that POO can be used with HCT as a subroutine with a regret upper bound that matches the one of best-known algorithms using the knowledge of smoothness up to a √ log n factor. On top of that, we propose a general wrapper, called GPO, that can cope with algorithms that only have simple regret guarantees. Finally, we complement our findings with experiments on difficult functions.
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-02047225/file/shang2019general.pdf BibTex
titre
Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds
auteur
Odalric-Ambrym Maillard
article
Algorithmic Learning Theory, 2019, Chicago, United States. pp.1 - 23
resume
We consider change-point detection in a fully sequential setup, when observations are received one by one and one must raise an alarm as early as possible after any change. We assume that both the change points and the distributions before and after the change are unknown. We consider the class of piecewise-constant mean processes with sub-Gaussian noise, and we target a detection strategy that is uniformly good on this class (this constrains the false alarm rate and detection delay). We introduce a novel tuning of the GLR test that takes here a simple form involving scan statistics, based on a novel sharp concentration inequality using an extension of the Laplace method for scan-statistics that holds doubly-uniformly in time. This also considerably simplifies the implementation of the test and analysis. We provide (perhaps surprisingly) the first fully non-asymptotic analysis of the detection delay of this test that matches the known existing asymptotic orders, with fully explicit numerical constants. Then, we extend this analysis to allow some changes that are not-detectable by any uniformly-good strategy (the number of observations before and after the change are too small for it to be detected by any such algorithm), and provide the first robust, finite-time analysis of the detection delay.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-02351665/file/maillard19.pdf BibTex

2018

Journal articles

titre
Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs
auteur
Mohammad Talebi, Odalric-Ambrym Maillard
article
Journal of Machine Learning Research, Microtome Publishing, In press, pp.1-36
resume
The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of the diameter of the MDP. Furthermore, we provide a novel analysis of the KL-Ucrl algorithm establishing a high-probability regret bound scaling as O S s,a V s,a T for this algorithm for ergodic MDPs, where S denotes the number of states and where V s,a is the variance of the bias function with respect to the next-state distribution following action a in state s. The resulting bound improves upon the best previously known regret bound O(DS √ AT) for that algorithm, where A and D respectively denote the maximum number of actions (per state) and the diameter of MDP. We finally compare the leading terms of the two bounds in some benchmark MDPs indicating that the derived bound can provide an order of magnitude improvement in some cases. Our analysis leverages novel variations of the transportation lemma combined with Kullback-Leibler concentration inequalities, that we believe to be of independent interest.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01737142/file/ALT18_CameraReady.pdf BibTex
titre
Streaming kernel regression with provably adaptive mean, variance, and regularization
auteur
Audrey Durand, Odalric-Ambrym Maillard, Joelle Pineau
article
Journal of Machine Learning Research, Microtome Publishing, 2018, 1, pp.1 - 48
resume
We consider the problem of streaming kernel regression, when the observations arrive sequentially and the goal is to recover the underlying mean function, assumed to belong to an RKHS. The variance of the noise is not assumed to be known. In this context, we tackle the problem of tuning the regularization parameter adaptively at each time step, while maintaining tight confidence bounds estimates on the value of the mean function at each point. To this end, we first generalize existing results for finite-dimensional linear regression with fixed regularization and known variance to the kernel setup with a regularization parameter allowed to be a measurable function of past observations. Then, using appropriate self-normalized inequalities we build upper and lower bound estimates for the variance, leading to Bersntein-like concentration bounds. The later is used in order to define the adaptive regularization. The bounds resulting from our technique are valid uniformly over all observation points and all time steps, and are compared against the literature with numerical experiments. Finally, the potential of these tools is illustrated by an application to kernelized bandits, where we revisit the Kernel UCB and Kernel Thompson Sampling procedures, and show the benefits of the novel adaptive kernel tuning strategy.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01927007/file/1708.00768.pdf BibTex
titre
Boundary Crossing Probabilities for General Exponential Families
auteur
Odalric-Ambrym Maillard
article
Mathematical Methods of Statistics, Allerton Press, Springer (link), 2018, 27
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01737150/file/BoundaryCrossingForExponentialFamilies_RevisedVersion.pdf BibTex

Conference papers

titre
Sequential Test for the Lowest Mean: From Thompson to Murphy Sampling
auteur
Emilie Kaufmann, Wouter Koolen, Aurélien Garivier
article
Advances in Neural Information Processing Systems (NIPS), Dec 2018, Montréal, Canada
resume
Learning the minimum/maximum mean among a finite set of distributions is a fundamental sub-task in planning, game tree search and reinforcement learning. We formalize this learning task as the problem of sequentially testing how the minimum mean among a finite set of distributions compares to a given threshold. We develop refined non-asymptotic lower bounds, which show that optimality mandates very different sampling behavior for a low vs high true minimum. We show that Thompson Sampling and the intuitive Lower Confidence Bounds policy each nail only one of these cases. We develop a novel approach that we call Murphy Sampling. Even though it entertains exclusively low true minima, we prove that MS is optimal for both possibilities. We then design advanced self-normalized deviation inequalities, fueling more aggressive stopping rules. We complement our theoretical guarantees by experiments showing that MS works best in practice.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01804581/file/arXiv.pdf BibTex
titre
Adaptive black-box optimization got easier: HCT only needs local smoothness
auteur
Xuedong Shang, Emilie Kaufmann, Michal Valko
article
European Workshop on Reinforcement Learning, Oct 2018, Lille, France
resume
Hierarchical bandits is an approach for global optimization of extremely irregular functions. This paper provides new elements regarding POO, an adaptive meta-algorithm that does not require the knowledge of local smoothness of the target function. We first highlight the fact that the subroutine algorithm used in POO should have a small regret under the assumption of local smoothness with respect to the chosen partitioning, which is unknown if it is satisfied by the standard subroutine HOO. In this work, we establish such regret guarantee for HCT, which is another hierarchical optimistic optimization algorithm that needs to know the smoothness. This confirms the validity of POO. We show that POO can be used with HCT as a subroutine with a regret upper bound that matches the one of best-known algorithms using the knowledge of smoothness up to a √ log n factor.
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-01874637/file/shang2018adaptive.pdf BibTex
titre
Aggregation of Multi-Armed Bandits Learning Algorithms for Opportunistic Spectrum Access
auteur
Lilian Besson, Emilie Kaufmann, Christophe Moy
article
IEEE WCNC - IEEE Wireless Communications and Networking Conference, Apr 2018, Barcelona, Spain. ⟨10.1109/wcnc.2018.8377070⟩
resume
Multi-armed bandit algorithms have been recently studied and evaluated for Cognitive Radio (CR), especially in the context of Opportunistic Spectrum Access (OSA). Several solutions have been explored based on various models, but it is hard to exactly predict which could be the best for real-world conditions at every instants. Hence, expert aggregation algorithms can be useful to select on the run the best algorithm for a specific situation. Aggregation algorithms, such as Exp4 dating back from 2002, have never been used for OSA learning, and we show that it appears empirically sub-efficient when applied to simple stochastic problems. In this article, we present an improved variant, called Aggregator. For synthetic OSA problems modeled as Multi-Armed Bandit (MAB) problems, simulation results are presented to demonstrate its empirical efficiency. We combine classical algorithms, such as Thompson sampling, Upper-Confidence Bounds algorithms (UCB and variants), and Bayesian or Kullback-Leibler UCB. Our algorithm offers good performance compared to state-of-the-art algorithms (Exp4, CORRAL or LearnExp), and appears as a robust approach to select on the run the best algorithm for any stochastic MAB problem, being more realistic to real-world radio settings than any tuning-based approach.
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-01705292/file/IEEE_WCNC__2018__Paper__Lilian_Besson__07-17.pdf BibTex
titre
Multi-Player Bandits Revisited
auteur
Lilian Besson, Emilie Kaufmann
article
Algorithmic Learning Theory, Mehryar Mohri; Karthik Sridharan, Apr 2018, Lanzarote, Spain
resume
Multi-player Multi-Armed Bandits (MAB) have been extensively studied in the literature, motivated by applications to Cognitive Radio systems. Driven by such applications as well, we motivate the introduction of several levels of feedback for multi-player MAB algorithms. Most existing work assume that sensing information is available to the algorithm. Under this assumption, we improve the state-of-the-art lower bound for the regret of any decentralized algorithms and introduce two algorithms, RandTopM and MCTopM, that are shown to empirically outperform existing algorithms. Moreover, we provide strong theoretical guarantees for these algorithms, including a notion of asymptotic optimality in terms of the number of selections of bad arms. We then introduce a promising heuristic, called Selfish, that can operate without sensing information, which is crucial for emerging applications to Internet of Things networks. We investigate the empirical performance of this algorithm and provide some first theoretical elements for the understanding of its behavior.
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-01629733/file/BK__ALT_2018.pdf BibTex
titre
Pure Exploration in Infinitely-Armed Bandit Models with Fixed-Confidence
auteur
Maryam Aziz, Jesse Anderton, Emilie Kaufmann, Javed Aslam
article
ALT 2018 - Algorithmic Learning Theory, Apr 2018, Lanzarote, Spain
resume
We consider the problem of near-optimal arm identification in the fixed confidence setting of the infinitely armed bandit problem when nothing is known about the arm reservoir distribution. We (1) introduce a PAC-like framework within which to derive and cast results; (2) derive a sample complexity lower bound for near-optimal arm identification; (3) propose an algorithm that identifies a nearly-optimal arm with high probability and derive an upper bound on its sample complexity which is within a log factor of our lower bound; and (4) discuss whether our log^2(1/delta) dependence is inescapable for ``two-phase'' (select arms first, identify the best later) algorithms in the infinite setting. This work permits the application of bandit models to a broader class of problems where fewer assumptions hold.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01729969/file/aziz18.pdf BibTex
titre
Corrupt Bandits for Preserving Local Privacy
auteur
Pratik Gajane, Tanguy Urvoy, Emilie Kaufmann
article
ALT 2018 - Algorithmic Learning Theory, Apr 2018, Lanzarote, Spain
resume
We study a variant of the stochastic multi-armed bandit (MAB) problem in which the rewards are corrupted. In this framework, motivated by privacy preservation in online recommender systems, the goal is to maximize the sum of the (unobserved) rewards, based on the observation of transformation of these rewards through a stochastic corruption process with known parameters. We provide a lower bound on the expected regret of any bandit algorithm in this corrupted setting. We devise a frequentist algoritthm, KLUCB-CF, and a Bayesian algorithm, TS-CF and give upper bounds on their regret. We also provide the appropriate corruption parameters to guarantee a desired level of local privacy and analyze how this impacts the regret. Finally, we present some experimental results that confirm our analysis.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01757297/file/Gajane18.pdf BibTex

2017

Journal articles

titre
The Non-stationary Stochastic Multi-armed Bandit Problem
auteur
Robin Allesiardo, Raphaël Féraud, Odalric-Ambrym Maillard
article
International Journal of Data Science and Analytics, Springer Verlag, 2017, 3 (4), pp.267-283. ⟨10.1007/s41060-017-0050-5⟩
resume
We consider a variant of the stochastic multi-armed bandit with K arms where the rewards are not assumed to be identically distributed, but are generated by a non-stationary stochastic process. We first study the unique best arm setting when there exists one unique best arm. Second, we study the general switching best arm setting when a best arm switches at some unknown steps. For both settings, we target problem-dependent bounds, instead of the more conservative problem-free bounds. We consider two classical problems: (1) identify a best arm with high probability (best arm identification), for which the performance measure by the sample complexity (number of samples before finding a near-optimal arm). To this end, we naturally extend the definition of sample complexity so that it makes sense in the switching best arm setting, which may be of independent interest. (2) Achieve the smallest cumulative regret (regret minimization) where the regret is measured with respect to the strategy pulling an arm with the best instantaneous mean at each step.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01575000/file/NonStaStoMAB_final.pdf BibTex

Conference papers

titre
Monte-Carlo Tree Search by Best Arm Identification
auteur
Emilie Kaufmann, Wouter Koolen
article
NIPS 2017 - 31st Annual Conference on Neural Information Processing Systems, Dec 2017, Long Beach, United States. pp.1-23
resume
Recent advances in bandit tools and techniques for sequential learning are steadily enabling new applications and are promising the resolution of a range of challenging related problems. We study the game tree search problem, where the goal is to quickly identify the optimal move in a given game tree by sequentially sampling its stochastic payoffs. We develop new algorithms for trees of arbitrary depth, that operate by summarizing all deeper levels of the tree into confidence intervals at depth one, and applying a best arm identification procedure at the root. We prove new sample complexity guarantees with a refined dependence on the problem instance. We show experimentally that our algorithms outperform existing elimination-based algorithms and match previous special-purpose methods for depth-two trees.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01535907/file/KK17arXiv.pdf BibTex
titre
Efficient tracking of a growing number of experts
auteur
Jaouad Mourtada, Odalric-Ambrym Maillard
article
Algorithmic Learning Theory, Oct 2017, Tokyo, Japan. pp.1 - 23
resume
We consider a variation on the problem of prediction with expert advice, where new forecasters that were unknown until then may appear at each round. As often in prediction with expert advice, designing an algorithm that achieves near-optimal regret guarantees is straightforward, using ag-gregation of experts. However, when the comparison class is sufficiently rich, for instance when the best expert and the set of experts itself changes over time, such strategies naively require to maintain a prohibitive number of weights (typically exponential with the time horizon). By contrast, designing strategies that both achieve a near-optimal regret and maintain a reasonable number of weights is highly non-trivial. We consider three increasingly challenging objectives (simple regret, shifting regret and sparse shifting regret) that extend existing notions defined for a fixed expert ensemble; in each case, we design strategies that achieve tight regret bounds, adaptive to the parameters of the comparison class, while being computationally inexpensive. Moreover, our algorithms are anytime , agnostic to the number of incoming experts and completely parameter-free. Such remarkable results are made possible thanks to two simple but highly effective recipes: first the " abstention trick " that comes from the specialist framework and enables to handle the least challenging notions of regret, but is limited when addressing more sophisticated objectives. Second, the " muting trick " that we introduce to give more flexibility. We show how to combine these two tricks in order to handle the most challenging class of comparison strategies.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01615424/file/paper-growing.pdf BibTex
titre
Boundary Crossing for General Exponential Families
auteur
Odalric-Ambrym Maillard
article
Algorithmic Learning Theory, Oct 2017, Kyoto, Japan. pp.1 - 34
resume
We consider parametric exponential families of dimension K on the real line. We study a variant of boundary crossing probabilities coming from the multi-armed bandit literature, in the case when the real-valued distributions form an exponential family of dimension K. Formally, our result is a concentration inequality that bounds the probability that B ψ (θ n , θ) f (t/n)/n, where θ is the parameter of an unknown target distribution, θ n is the empirical parameter estimate built from n observations, ψ is the log-partition function of the exponential family and B ψ is the corresponding Bregman divergence. From the perspective of stochastic multi-armed bandits, we pay special attention to the case when the boundary function f is logarithmic, as it enables to analyze the regret of the state-of-the-art KL-ucb and KL-ucb+ strategies, whose analysis was left open in such generality. Indeed, previous results only hold for the case when K = 1, while we provide results for arbitrary finite dimension K, thus considerably extending the existing results. Perhaps surprisingly, we highlight that the proof techniques to achieve these strong results already existed three decades ago in the work of T.L. Lai, and were apparently forgotten in the bandit community. We provide a modern rewriting of these beautiful techniques that we believe are useful beyond the application to stochastic multi-armed bandits.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01615427/file/15.pdf BibTex
titre
Multi-Armed Bandit Learning in IoT Networks: Learning helps even in non-stationary settings
auteur
Rémi Bonnefoi, Lilian Besson, Christophe Moy, Emilie Kaufmann, Jacques Palicot
article
CROWNCOM 2017 - 12th EAI International Conference on Cognitive Radio Oriented Wireless Networks, Sep 2017, Lisbon, Portugal. pp.173-185, ⟨10.1007/978-3-319-76207-4_15⟩
resume
Setting up the future Internet of Things (IoT) networks will require to support more and more communicating devices. We prove that intelligent devices in unlicensed bands can use Multi-Armed Bandit (MAB) learning algorithms to improve resource exploitation. We evaluate the performance of two classical MAB learning algorithms, UCB1 and Thompson Sampling, to handle the decentralized decision-making of Spectrum Access, applied to IoT networks; as well as learning performance with a growing number of intelligent end-devices. We show that using learning algorithms does help to fit more devices in such networks, even when all end-devices are intelligent and are dynamically changing channel. In the studied scenario, stochastic MAB learning provides a up to 16% gain in term of successful transmission probabilities, and has near optimal performance even in non-stationary and non-i.i.d. settings with a majority of intelligent devices.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01575419/file/BBMKP_CROWNCOM_2017.pdf BibTex
titre
Spectral Learning from a Single Trajectory under Finite-State Policies
auteur
Borja Balle, Odalric-Ambrym Maillard
article
International conference on Machine Learning, Jul 2017, Sidney, France
resume
We present spectral methods of moments for learning sequential models from a single trajec-tory, in stark contrast with the classical literature that assumes the availability of multiple i.i.d. trajectories. Our approach leverages an efficient SVD-based learning algorithm for weighted au-tomata and provides the first rigorous analysis for learning many important models using dependent data. We state and analyze the algorithm under three increasingly difficult scenarios: proba-bilistic automata, stochastic weighted automata, and reactive predictive state representations controlled by a finite-state policy. Our proofs include novel tools for studying mixing properties of stochastic weighted automata.
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01590940/file/cr_ICML_HankelMatrices2.pdf BibTex