Publications HAL de Odalric-Ambrym Maillard

2025

Conference papers

titre: A Continual Offline Reinforcement Learning Benchmark for Navigation Tasks
auteur: Anthony Kobanda, Odalric-Ambrym Maillard, Rémy Portelas
article: COG 2025 - IEEE Conference on Games, Aug 2025, Lisboa, Portugal
resume: Autonomous agents operating in domains such as robotics or video game simulations must adapt to changing tasks without forgetting about the previous ones. This process called Continual Reinforcement Learning poses non-trivial difficulties, from preventing catastrophic forgetting to ensuring the scalability of the approaches considered. Building on recent advances, we introduce a benchmark providing a suite of video-game navigation scenarios, thus filling a gap in the literature and capturing key challenges : catastrophic forgetting, task adaptation, and memory efficiency. We define a set of various tasks and datasets, evaluation protocols, and metrics to assess the performance of algorithms, including state-of-the-art baselines. Our benchmark is designed not only to foster reproducible research and to accelerate progress in continual reinforcement learning for gaming, but also to provide a reproducible framework for production pipelines -- helping practitioners to identify and to apply effective approaches.
Accès au texte intégral et bibtex

titre: Hierarchical Subspaces of Policies for Continual Offline Reinforcement Learning
auteur: Anthony Kobanda, Rémy Portelas, Odalric-Ambrym Maillard, Ludovic Denoyer
article: ICLR-MCDC 2025 - Workshop on Modularity for Collaborative, Decentralized, and Continual Deep Learning, Apr 2025, Singapore, Singapore
resume: We consider a Continual Reinforcement Learning setup, where a learning agent must continuously adapt to new tasks while retaining previously acquired skill sets, with a focus on the challenge of avoiding forgetting past gathered knowledge and ensuring scalability with the growing number of tasks. Such issues prevail in autonomous robotics and video game simulations, notably for navigation tasks prone to topological or kinematic changes. To address these issues, we introduce HiSPO, a novel hierarchical framework designed specifically for continual learning in navigation settings from offline data. Our method leverages distinct policy subspaces of neural networks to enable flexible and efficient adaptation to new tasks while preserving existing knowledge. We demonstrate, through a careful experimental study, the effectiveness of our method in both classical MuJoCo maze environments and complex video game-like navigation simulations, showcasing competitive performances and satisfying adaptability with respect to classical continual learning metrics, in particular regarding the memory usage and efficiency.
Accès au texte intégral et bibtex

Preprints, Working Papers, ...

titre: Kriging and Gaussian Process Interpolation for Georeferenced Data Augmentation
auteur: Frédérick Fabre Ferber, Dominique Gay, Jean-Christophe Soulié, Jean Diatta, Odalric-Ambrym Maillard
article: 2025
resume: Data augmentation is a crucial step in the development of robust supervised learning models, especially when dealing with limited datasets. This study explores interpolation techniques for the augmentation of geo-referenced data, with the aim of predicting the presence of Commelina benghalensis L. in sugarcane plots in La Réunion. Given the spatial nature of the data and the high cost of data collection, we evaluated two interpolation approaches: Gaussian processes (GPs) with different kernels and kriging with various variograms. The objectives of this work are threefold: (i) to identify which interpolation methods offer the best predictive performance for various regression algorithms, (ii) to analyze the evolution of performance as a function of the number of observations added, and (iii) to assess the spatial consistency of augmented datasets. The results show that GP-based methods, in particular with combined kernels (GP-COMB), significantly improve the performance of regression algorithms while requiring less additional data. Although kriging shows slightly lower performance, it is distinguished by a more homogeneous spatial coverage, a potential advantage in certain contexts.
Accès au texte intégral et bibtex

2024

Journal articles

titre: Bandits with Stochastic Corruption: Lower Bounds on Regret and Robust Optimistic Algorithms
auteur: Timothée Mathieu, Debabrota Basu, Odalric-Ambrym Maillard
article: Transactions on Machine Learning Research Journal, 2024
resume: We study the Bandits with Stochastic Corruption problem, i.e. a stochastic multi-armed bandit problem with $k$ unknown reward distributions, which are heavy-tailed and corrupted by a history-independent stochastic adversary or Nature. To be specific, the reward obtained by playing an arm comes from corresponding heavy-tailed reward distribution with probability $1-\varepsilon \in (0.5,1]$ and an arbitrary corruption distribution of unbounded support with probability $\varepsilon \in [0,0.5)$. First, we provide \textit{a problem-dependent lower bound on the regret} of any corrupted bandit algorithm. The lower bounds indicate that the Bandits with Stochastic Corruption problem is harder than the classical stochastic bandit problem with sub-Gaussian or heavy-tail rewards. Following that, we propose a novel UCB-type algorithm for Bandits with Stochastic Corruption, namely \texttt{HubUCB}, that builds on Huber's estimator for robust mean estimation. Leveraging a novel concentration inequality of Huber's estimator, we prove that \texttt{HubUCB} achieves a near-optimal regret upper bound. Since computing Huber's estimator has quadratic complexity, we further introduce a sequential version of Huber's estimator that exhibits linear complexity. We leverage this sequential estimator to design \texttt{SeqHubUCB} that enjoys similar regret guarantees while reducing the computational burden. Finally, we experimentally illustrate the efficiency of \texttt{HubUCB} and \texttt{SeqHubUCB} in solving Bandits with Stochastic Corruption for different reward distributions and different levels of corruptions.
Accès au texte intégral et bibtex

titre: AdaStop: adaptive statistical testing for sound comparisons of Deep RL agents
auteur: Timothée Mathieu, Riccardo Della Vecchia, Alena Shilova, Matheus Medeiros Centa, Hector Kohler, Odalric-Ambrym Maillard, Philippe Preux
article: Transactions on Machine Learning Research Journal, 2024
resume: The reproducibility of many experimental results in Deep Reinforcement Learning (RL) is under question. To solve this reproducibility crisis, we propose a theoretically sound methodology to compare multiple Deep RL algorithms. The performance of one execution of a Deep RL algorithm is random so that independent executions are needed to assess it precisely. When comparing several RL algorithms, a major question is how many executions must be made and how can we assure that the results of such a comparison is theoretically sound. Researchers in Deep RL often use less than 5 independent executions to compare algorithms: we claim that this is not enough in general. Moreover, when comparing several algorithms at once, the error of each comparison accumulates and must be taken into account with a multiple tests procedure to preserve low error guarantees. To address this problem in a statistically sound way, we introduce AdaStop, a new statistical test based on multiple group sequential tests. When comparing algorithms, AdaStop adapts the number of executions to stop as early as possible while ensuring that we have enough information to distinguish algorithms that perform better than the others in a statistical significant way. We prove both theoretically and empirically that AdaStop has a low probability of making an error (Family-Wise Error). Finally, we illustrate the effectiveness of AdaStop in multiple use-cases, including toy examples and difficult cases such as Mujoco environments.
Accès au texte intégral et bibtex

Conference papers

titre: Bandits with Multimodal Structure
auteur: Hassan Saber, Odalric-Ambrym Maillard
article: RLC 2024 - Reinforcement Learning Conference, Aug 2024, Amherst Massachusetts, United States. pp.39
resume: We consider a multi-armed bandit problem specified by a set of one-dimensional exponential family distributions endowed with a multimodal structure. The multimodal structure naturally extends the unimodal structure and appears to be underlying in quite interesting ways popular structures such as linear or Lipschitz bandits. We introduce IMED-MB, an algorithm that optimally exploits the multimodal structure, by adapting to this setting the popular Indexed Minimum Empirical Divergence (IMED) algorithm. We provide instance-dependent regret analysis of this strategy. Numerical experiments show that IMED-MB performs well in practice when assuming unimodal, polynomial or Lipschitz mean function.
Accès au texte intégral et bibtex

titre: Power Mean Estimation in Stochastic Monte-Carlo Tree Search
auteur: Tuan Quang Tuan Dam, Odalric-Ambrym Maillard, Emilie Kaufmann
article: Uncertainty in Artificial Intelligence, Jul 2024, Barcelona, Spain
resume: Monte-Carlo Tree Search (MCTS) is a widelyused strategy for online planning that combines Monte-Carlo sampling with forward tree search. Its success relies on the Upper Confidence bound for Trees (UCT) algorithm, an extension of the UCB method for multi-arm bandits. However, the theoretical foundation of UCT is incomplete due to an error in the logarithmic bonus term for action selection, leading to the development of Fixed-Depth-MCTS with a polynomial exploration bonus to balance exploration and exploitation [Shah et al., 2022]. Both UCT and Fixed-Depth-MCTS suffer from biased value estimation: the weighted sum underestimates the optimal value, while the maximum valuation overestimates it [Coulom, 2006]. The power mean estimator offers a balanced solution, lying between the average and maximum values. Power-UCT [Dam et al., 2019] incorporates this estimator for more accurate value estimates but its theoretical analysis remains incomplete. This paper introduces Stochastic-Power-UCT, an MCTS algorithm using the power mean estimator and tailored for stochastic MDPs. We analyze its polynomial convergence in estimating root node values and show that it shares the same convergence rate of O(n -1/2 ), with n is the number of visited trajectories, as Fixed-Depth-MCTS, with the latter being a special case of the former. Our theoretical results are validated with empirical tests across various stochastic MDP environments.
Accès au texte intégral et bibtex

titre: CRIMED: Lower and Upper Bounds on Regret for Bandits with Unbounded Stochastic Corruption
auteur: Shubhada Agrawal, Timothée Mathieu, Debabrota Basu, Odalric-Ambrym Maillard
article: International Conference on Algorithmic Learning Theory (ALT), Feb 2024, San Diego (CA), United States. pp.74-124
resume: We investigate the regret-minimisation problem in a multi-armed bandit setting with arbitrary corruptions. Similar to the classical setup, the agent receives rewards generated independently from the distribution of the arm chosen at each time. However, these rewards are not directly observed. Instead, with a fixed $\varepsilon\in (0,\frac{1}{2})$, the agent observes a sample from the chosen arm's distribution with probability $1-\varepsilon$, or from an arbitrary corruption distribution with probability $\varepsilon$. Importantly, we impose no assumptions on these corruption distributions, which can be unbounded. In this setting, accommodating potentially unbounded corruptions, we establish a problem-dependent lower bound on regret for a given family of arm distributions. We introduce CRIMED, an asymptotically-optimal algorithm that achieves the exact lower bound on regret for bandits with Gaussian distributions with known variance. Additionally, we provide a finite-sample analysis of CRIMED's regret performance. Notably, CRIMED can effectively handle corruptions with $\varepsilon$ values as high as $\frac{1}{2}$. Furthermore, we develop a tight concentration result for medians in the presence of arbitrary corruptions, even with $\varepsilon$ values up to $\frac{1}{2}$, which may be of independent interest. We also discuss an extension of the algorithm for handling misspecification in Gaussian model.
Accès au bibtex

Poster communications

titre: Évaluation de critères de sélection de noyaux pour la régression Ridge à noyau dans un contexte de petits jeux de données
auteur: Frédérick Fabre Ferber, Dominique Gay, Jean-Christophe Soulié, Jean Diatta, Odalric-Ambrym Maillard
article: 24ème conférence francophone sur l'Extraction et la Gestion des Connaissances EGC 2024, Jan 2024, Dijon, France. RNTI E-40
Accès au texte intégral et bibtex

2023

Conference papers

titre: Fast Asymptotically Optimal Algorithms for Non-Parametric Stochastic Bandits
auteur: Dorian Baudry, Fabien Pesquerel, Rémy Degenne, Odalric-Ambrym Maillard
article: NeurIPS 2023 - Thirty-seventh Conference on Neural Information Processing Systems, Dec 2023, New Orleans (Louisiana), United States
resume: We consider the problem of regret minimization in non-parametric stochastic bandits. When the rewards are known to be bounded from above, there exists asymptotically optimal algorithms, with asymptotic regret depending on an infimum of Kullback-Leibler divergences (KL). These algorithms are computationally expensive and require storing all past rewards, thus simpler but non-optimal algorithms are often used instead. We introduce several methods to approximate the infimum KL which reduce drastically the computational and memory costs of existing optimal algorithms, while keeping their regret guaranties. We apply our findings to design new variants of the MED and IMED algorithms, and demonstrate their interest with extensive numerical simulations.
Accès au texte intégral et bibtex

titre: Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits
auteur: Hassan Saber, Fabien Pesquerel, Odalric-Ambrym Maillard, Mohammad Sadegh Talebi
article: Asian Conference on Machine Learning, Nov 2023, Istanbul, Turkey
resume: We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a nonasymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.
Accès au texte intégral et bibtex

titre: Bregman Deviations of Generic Exponential Families
auteur: Sayak Ray Chowdhury, Patrick Saux, Odalric-Ambrym Maillard, Aditya Gopalan
article: Conference On Learning Theory (COLT), Jul 2023, Bangalore, India
resume: We revisit the method of mixture technique, also known as the Laplace method, to study the concentration phenomenon in generic exponential families. Combining the properties of Bregman divergence associated with log-partition function of the family with the method of mixtures for super-martingales, we establish a generic bound controlling the Bregman divergence between the parameter of the family and a finite sample estimate of the parameter. Our bound is time-uniform and makes appear a quantity extending the classical information gain to exponential families, which we call the Bregman information gain. For the practitioner, we instantiate this novel bound to several classical families, e.g., Gaussian, Bernoulli, Exponential, Weibull, Pareto, Poisson and Chi-square yielding explicit forms of the confidence sets and the Bregman information gain. We further numerically compare the resulting confidence bounds to state-of-the-art alternatives for time-uniform concentration and show that this novel method yields competitive results. Finally, we highlight the benefit of our concentration bounds on some illustrative applications.
Accès au texte intégral et bibtex

titre: Risk-aware linear bandits with convex loss
auteur: Patrick Saux, Odalric-Ambrym Maillard
article: International Conference on Artificial Intelligence and Statistics (AISTATS), Apr 2023, Valencia, Spain
resume: In decision-making problems such as the multi-armed bandit, an agent learns sequentially by optimizing a certain feedback. While the mean reward criterion has been extensively studied, other measures that reflect an aversion to adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR), can be of interest for critical applications (healthcare, agriculture). Algorithms have been proposed for such risk-aware measures under bandit feedback without contextual information. In this work, we study contextual bandits where such risk measures can be elicited as linear functions of the contexts through the minimization of a convex loss. A typical example that fits within this framework is the expectile measure, which is obtained as the solution of an asymmetric least-square problem. Using the method of mixtures for supermartingales, we derive confidence sequences for the estimation of such risk measures. We then propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent, at the cost of slightly higher regret. We conclude by evaluating the resulting algorithms on numerical experiments.
Accès au texte intégral et bibtex

titre: Farm-gym: A modular reinforcement learning platform for stochastic agronomic games
auteur: Odalric-Ambrym Maillard, Timothée Mathieu, Debabrota Basu
article: AIAFS 2023 - Artificial Intelligence for Agriculture and Food Systems, Feb 2023, Wahington DC, United States
resume: We introduce Farm-gym, an open-source farming environment written in Python, that models sequential decisionmaking in farms using Reinforcement Learning (RL). Farm-gym conceptualizes a farm as a dynamical system with many interacting entities. Leveraging a modular design, it enables us to instantiate from very simple to highly complicated environments. Contrasting many available gym environments, Farm-gym features intrinsically stochastic games, using stochastic growth models and weather data. Further, it enables to create farm games in a modular way, activating or not the entities (e.g. weeds, pests, pollinators), and yielding non-trivial coupled dynamics. Finally, every game can be customized with .yaml files for rewards, feasible actions, and initial/end-game conditions. We illustrate some interesting features on simple farms. We also showcase the challenges posed by Farm-gym to the deep RL algorithms, in order to stimulate studies in the RL community.
Accès au texte intégral et bibtex

titre: Learning crop management by reinforcement: gym-DSSAT
auteur: Romain Gautron, Emilio J Padrón, Philippe Preux, Julien Bigot, Odalric-Ambrym Maillard, Gerrit Hoogenboom, Julien Teigny
article: AIAFS 2023 - 2nd AAAI Workshop on AI for Agriculture and Food Systems, Feb 2023, Washignton DC, United States
resume: We introduce gym-DSSAT, a gym environment for crop management tasks, that is easy to use for training Reinforcement Learning (RL) agents. gym-DSSAT is based on DSSAT, a state-of-the-art mechanistic crop growth simulator. We modify DSSAT so that an external software agent can interact with it to control the actions performed in a crop field during a growing season. The RL environment provides predefined decision problems without having to manipulate the complex crop simulator. We report encouraging preliminary results on a use case of nitrogen fertilization for maize. This work opens up opportunities to explore new sustainable crop management strategies with RL, and provides RL researchers with an original set of challenging tasks to investigate.
Accès au texte intégral et bibtex

titre: Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration & Planning
auteur: Reda Ouhamma, Debabrota Basu, Odalric-Ambrym Maillard
article: Proceedings of the AAAI Conference on Artificial Intelligence, Feb 2023, Washignton DC, United States. pp.9336-9344, ⟨10.1609/aaai.v37i8.26119⟩
resume: We study the problem of episodic reinforcement learning in continuous stateaction spaces with unknown rewards and transitions. Specifically, we consider the setting where the rewards and transitions are modeled using parametric bilinear exponential families. We propose an algorithm, BEF-RLSVI, that a) uses penalized maximum likelihood estimators to learn the unknown parameters, b) injects a calibrated Gaussian noise in the parameter of rewards to ensure exploration, and c) leverages linearity of the exponential family with respect to an underlying RKHS to perform tractable planning. We further provide a frequentist regret analysis of BEF-RLSVI that yields an upper bound of Õ( (d^3 H^3 K)^{1/2} ), where d is the dimension of the parameters, H is the episode length, and K is the number of episodes. Our analysis improves the existing bounds for the bilinear exponential family of MDPs by √H and removes the handcrafted clipping deployed in existing RLSVI-type algorithms. Our regret bound is order-optimal with respect to H and K.
Accès au texte intégral et bibtex

2022

Journal articles

titre: A channel selection game for multi-operator LoRaWAN deployments
auteur: Kinda Khawam, Hassan Fawaz, Samer Lahoud, Odalric-Ambrym Maillard, Steven Martin
article: Computer Networks, 2022, 216, pp.109185. ⟨10.1016/j.comnet.2022.109185⟩
resume: With over 75 billion Internet of Things (IoT) devices expected worldwide by the year 2025, inaugural MAC layer solutions for long-range IoT deployments no longer suffice. LoRaWAN, the principal technology for comprehensive IoT deployments, enables low power and long range communications. However, synchronous transmissions on the same spreading factors and within the same frequency channel, augmented by the necessary clustering of IoT devices, will cause collisions and drops in efficiency. The performance is further impacted by the shortage of radio resources and by multiple operators utilizing the same unlicensed frequency bands. In this paper, we propose a game theoretic based channel selection algorithm for LoRaWAN in a multi-operator deployment scenario. We begin by proposing an optimal formulation for the selection process with the objective of maximizing the total normalized throughput per spreading factor, per channel. Then, we propose centralized optimal approaches, as well as distributed algorithms, based on reinforcement learning and regret matching dynamics, to finding both the Nash and Correlated equilibria of the proposed game. We simulate our proposals and compare them to the legacy approach of randomized channel access, stressing their efficiency in improving the total normalized throughput as well as the packet delivery ratios.
Accès au bibtex

titre: Reinforcement Learning for crop management
auteur: Romain Gautron, Odalric-Ambrym Maillard, Philippe Preux, Marc Corbeels, Régis Sabbadin
article: Computers and Electronics in Agriculture, 2022, 200, pp.107182. ⟨10.1016/j.compag.2022.107182⟩
resume: Highlights: • Reinforcement learning is a promising AI framework to support crop management. • Reinforcement learning-based crop management support literature is scarce. • A reinforcement learning-based system should learn from interactions on the ground. • Crop management support is related to many reinforcement learning research questions. • Joint research by the reinforcement learning and agronomy communities is required. Abstract: Reinforcement learning (RL), including multi-armed bandits, is a branch of machine learning that deals with the problem of sequential decision-making in uncertain and unknown environments through learning by practice. While best known for being the core of the artificial intelligence (AI) world’s best Go game player, RL has a vast range of potential applications. RL may help to address some of the criticisms leveled against crop management decision support systems (DSS): it is an interactive, geared towards action, contextual tool to evaluate series of crop operations faced with uncertainties. A review of RL use for crop management DSS reveals a limited number of contributions. We profile key prospects for a human-centered, real-world, interactive RL-based system to face tomorrow’s agricultural decisions, and theoretical and ongoing practical challenges that may explain its current low uptake. We argue that a joint research effort from the RL and agronomy communities is necessary to explore RL’s full potential.
Accès au texte intégral et bibtex

titre: Efficient Change-Point Detection for Tackling Piecewise-Stationary Bandits
auteur: Lilian Besson, Emilie Kaufmann, Odalric-Ambrym Maillard, Julien Seznec
article: Journal of Machine Learning Research, 2022
resume: We introduce GLR-klUCB, a novel algorithm for the piecewise iid non-stationary bandit problem with bounded rewards. This algorithm combines an efficient bandit algorithm, kl-UCB, with an efficient, parameter-free, changepoint detector, the Bernoulli Generalized Likelihood Ratio Test, for which we provide new theoretical guarantees of independent interest. Unlike previous non-stationary bandit algorithms using a change-point detector, GLR-klUCB does not need to be calibrated based on prior knowledge on the arms' means. We prove that this algorithm can attain a $O(\sqrt{TA \Upsilon_T\log(T)})$ regret in $T$ rounds on some ``easy'' instances, where A is the number of arms and $\Upsilon_T$ the number of change-points, without prior knowledge of $\Upsilon_T$. In contrast with recently proposed algorithms that are agnostic to $\Upsilon_T$, we perform a numerical study showing that GLR-klUCB is also very efficient in practice, beyond easy instances.
Accès au texte intégral et bibtex

titre: Local Dvoretzky-Kiefer-Wolfowitz confidence bands
auteur: Odalric-Ambrym Maillard
article: Mathematical Methods of Statistics, 2022, ⟨10.3103/S1066530721010038⟩
resume: In this paper, we revisit the concentration inequalities for the supremum of the cumulative distribution function (CDF) of a real-valued continuous distribution as established by Dvoretzky, Kiefer, Wolfowitz and revisited later by Massart in in two seminal papers. We focus on the concentration of the local supremum over a sub-interval, rather than on the full domain. That is, denoting U the CDF of the uniform distribution over [0, 1] and U n its empirical version built from n samples, we study P sup u∈[u,u] U n (u)−U (u) > ε for different values of u, u ∈ [0, 1]. Such local controls naturally appear for instance when studying estimation error of spectral risk-measures (such as the conditional value at risk), where [u, u] is typically [0, α] or [1 − α, 1] for a risk level α, after reshaping the CDF F of the considered distribution into U by the general inverse transform F −1. Extending a proof technique from Smirnov, we provide exact expressions of the local quantities P sup u∈[u,u] U n (u) − U (u) > ε and P sup u∈[u,u] U (u) − U n (u) > ε for each n, ε, u, u. Interestingly these quantities, seen as a function of ε, can be easily inverted numerically into functions of the probability level δ. Although not explicit, they can be computed and tabulated. We plot such expressions and compare them to the classical bound log(1/δ) 2n provided by Massart inequality. We then provide an application of such result to the control of generic functional of the CDF, motivated by the case of the conditional value at risk. Last, we extend the local concentration results holding individually for each n to time-uniform concentration inequalities holding simultaneously for all n, revisiting a reflection inequality by James, which is of independent interest for the study of sequential decision making strategies.
Accès au texte intégral et bibtex

titre: Collaborative Algorithms for Online Personalized Mean Estimation
auteur: Mahsa Asadi, Aurélien Bellet, Odalric-Ambrym Maillard, Marc Tommasi
article: Transactions on Machine Learning Research Journal, 2022
resume: We consider an online estimation problem involving a set of agents. Each agent has access to a (personal) process that generates samples from a real-valued distribution and seeks to estimate its mean. We study the case where some of the distributions have the same mean, and the agents are allowed to actively query information from other agents. The goal is to design an algorithm that enables each agent to improve its mean estimate thanks to communication with other agents. The means as well as the number of distributions with same mean are unknown, which makes the task nontrivial. We introduce a novel collaborative strategy to solve this online personalized mean estimation problem. We analyze its time complexity and introduce variants that enjoy good performance in numerical experiments. We also extend our approach to the setting where clusters of agents with similar means seek to estimate the mean of their cluster.
Accès au texte intégral et bibtex

Conference papers

titre: IMED-RL: Regret optimal learning of ergodic Markov decision processes
auteur: Fabien Pesquerel, Odalric-Ambrym Maillard
article: NeurIPS 2022 - Thirty-sixth Conference on Neural Information Processing Systems, Nov 2022, New-Orleans, United States
resume: We consider reinforcement learning in a discrete, undiscounted, infinite-horizon Markov Decision Problem (MDP) under the average reward criterion, and focus on the minimization of the regret with respect to an optimal policy, when the learner does not know the rewards nor the transitions of the MDP. In light of their success at regret minimization in multi-armed bandits, popular bandit strategies, such as the optimistic UCB, KL-UCB or the Bayesian Thompson sampling strategy, have been extended to the MDP setup. Despite some key successes, existing strategies for solving this problem either fail to be provably asymptotically optimal, or suffer from prohibitive burn-in phase and computational complexity when implemented in practice. In this work, we shed a novel light on regret minimization strategies, by extending to reinforcement learning the computationally appealing Indexed Minimum Empirical Divergence (IMED) bandit algorithm. Traditional asymptotic problem-dependent lower bounds on the regret are known under the assumption that the MDP is ergodic. Under this assumption, we introduce IMED-RL and prove that its regret upper bound asymptotically matches the regret lower bound. We discuss both the case when the supports of transitions are unknown, and the more informative but a priori harder-to-exploit-optimally case when they are known. Rewards are assumed light-tailed, semi-bounded from above. Last, we provide numerical illustrations on classical tabular MDPs, ergodic and communicating only, showing the competitiveness of IMED-RL in finite-time against state-of-the-art algorithms. IMED-RL also benefits from a light complexity.
Accès au texte intégral et bibtex

titre: Risk-aware linear bandits with convex loss
auteur: Patrick Saux, Odalric-Ambrym Maillard
article: European Workshop on Reinforcement Learning, Sep 2022, Milan, Italy
resume: In decision-making problems such as the multi-armed bandit, an agent learns sequentially by optimizing a certain feedback. While the mean reward criterion has been extensively studied, other measures that reflect an aversion to adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR), can be of interest for critical applications (healthcare, agriculture). Algorithms have been proposed for such risk-aware measures under bandit feedback without contextual information. In this work, we study contextual bandits where such risk measures can be elicited as linear functions of the contexts through the minimization of a convex loss. A typical example that fits within this framework is the expectile measure, which is obtained as the solution of an asymmetric least-square problem. Using the method of mixtures for supermartingales, we derive confidence sequences for the estimation of such risk measures. We then propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent, at the cost of slightly higher regret. We conclude by evaluating the resulting algorithms on numerical experiments.
Accès au texte intégral et bibtex

Poster communications

titre: Petits jeux de données et prédiction en Intelligence Artificielle, vers une meilleure cohabitation : Application à la gestion durable de l'enherbement des systèmes agricoles à La Réunion
auteur: Frédérick Fabre Ferber, Jean Diatta, Jean-Christophe Soulié, Dominique Gay, Odalric-Ambrym Maillard, Thomas Le Bourgeois, Sandrine Auzoux
article: Comité scientifique et technique du DPP CapTerre, Nov 2022, Saint-Leu de La Réunion, La Réunion
resume: Transformer les données pour obtenir une meilleure représentation et ainsi, mieux gérer la gestion de l'enherbement Kernels et transformation d'espace Perspectives Ces travaux de thèse permettront d'améliorer d'une part, la performance prédictive des algorithmes pour une meilleure gestion de l'enherbement des systèmes agricoles et d'autre part, pourront guider à la création de protocoles efficaces de collecte de données agronomiques pour améliorer leur qualité.
Accès au texte intégral et bibtex

Reports

titre: gym-DSSAT: a crop model turned into a Reinforcement Learning environment
auteur: Romain Gautron, Emilio J. Padrón, Philippe Preux, Julien Bigot, Odalric-Ambrym Maillard, David Emukpere
article: [Research Report] RR-9460, Inria Lille. 2022, pp.31
resume: Addressing a real world sequential decision problem with Reinforcement Learning (RL) usually starts with the use of a simulated environment that mimics real conditions. We present a novel open source RL environment for realistic crop management tasks. gym-DSSAT is a gym interface to the Decision Support System for Agrotechnology Transfer (DSSAT), a high fidelity crop simulator. DSSAT has been developped over the last 30 years and is widely recognized by agronomists. gym-DSSAT comes with predefined simulations based on real world maize experiments. The environment is as easy to use as any gym environment. We provide performance baselines using basic RL algorithms. We also briefly outline how the monolithic DSSAT simulator written in Fortran has been turned into a Python RL environment. Our methodology is generic and may be applied to similar simulators. We report on very preliminary experimental results which suggest that RL can help researchers to improve sustainability of fertilization and irrigation practices.
Accès au texte intégral et bibtex

Preprints, Working Papers, ...

titre: Bandits Corrupted by Nature: Lower Bounds on Regret and Robust Optimistic Algorithm
auteur: Debabrota Basu, Odalric-Ambrym Maillard, Timothée Mathieu
article: 2022
resume: In this paper, we study the stochastic bandits problem with k unknown heavy-tailed and corrupted reward distributions or arms with time-invariant corruption distributions. At each iteration, the player chooses an arm. Given the arm, the environment returns an uncorrupted reward with probability 1−ε and an arbitrarily corrupted reward with probability ε. In our setting, the uncorrupted reward might be heavy-tailed and the corrupted reward might be unbounded. We prove a lower bound on the regret indicating that the corrupted and heavy-tailed bandits are strictly harder than uncorrupted or light-tailed bandits. We observe that the environments can be categorised into hardness regimes depending on the suboptimality gap ∆, variance σ, and corruption proportion ϵ. Following this, we design a UCB-type algorithm, namely HuberUCB, that leverages Huber's estimator for robust mean estimation. HuberUCB leads to tight upper bounds on regret in the proposed corrupted and heavy-tailed setting. To derive the upper bound, we prove a novel concentration inequality for Huber's estimator, which might be of independent interest.
Accès au texte intégral et bibtex

2021

Conference papers

titre: Indexed Minimum Empirical Divergence for Unimodal Bandits
auteur: Hassan Saber, Pierre Ménard, Odalric-Ambrym Maillard
article: NeurIPS 2021 - International Conference on Neural Information Processing Systems, Dec 2021, Virtual-only Conference, United States
resume: We consider a multi-armed bandit problem specified by a set of one-dimensional family exponential distributions endowed with a unimodal structure. We introduce IMED-UB, a algorithm that optimally exploits the unimodal-structure, by adapting to this setting the Indexed Minimum Empirical Divergence (IMED) algorithm introduced by Honda and Takemura [2015]. Owing to our proof technique, we are able to provide a concise finite-time analysis of IMED-UB algorithm. Numerical experiments show that IMED-UB competes with the state-of-the-art algorithms.
Accès au texte intégral et bibtex

titre: From Optimality to Robustness: Dirichlet Sampling Strategies in Stochastic Bandits
auteur: Dorian Baudry, Patrick Saux, Odalric-Ambrym Maillard
article: NeurIPS 2021 - 35th International Conference on Neural Information Processing Systems, Dec 2021, Sydney, Australia
resume: The stochastic multi-arm bandit problem has been extensively studied under standard assumptions on the arm's distribution (e.g bounded with known support, exponential family, etc). These assumptions are suitable for many real-world problems but sometimes they require knowledge (on tails for instance) that may not be precisely accessible to the practitioner, raising the question of the robustness of bandit algorithms to model misspecification. In this paper we study a generic Dirichlet Sampling (DS) algorithm, based on pairwise comparisons of empirical indices computed with re-sampling of the arms' observations and a data-dependent exploration bonus. We show that different variants of this strategy achieve provably optimal regret guarantees when the distributions are bounded and logarithmic regret for semi-bounded distributions with a mild quantile condition. We also show that a simple tuning achieve robustness with respect to a large class of unbounded distributions, at the cost of slightly worse than logarithmic asymptotic regret. We finally provide numerical experiments showing the merits of DS in a decision-making problem on synthetic agriculture data.
Accès au texte intégral et bibtex

titre: Stochastic bandits with groups of similar arms
auteur: Fabien Pesquerel, Hassan Saber, Odalric-Ambrym Maillard
article: NeurIPS 2021 - Thirty-fifth Conference on Neural Information Processing Systems, Dec 2021, Sydney, Australia
resume: We consider a variant of the stochastic multi-armed bandit problem where arms are known to be organized into different groups having the same mean. The groups are unknown but a lower bound q on their size is known. This situation typically appears when each arm can be described with a list of categorical attributes, and the (unknown) mean reward function only depends on a subset of them, the others being redundant. In this case, q is linked naturally to the number of attributes considered redundant, and the number of categories of each attribute. For this structured problem of practical relevance, we first derive the asymptotic regret lower bound and corresponding constrained optimization problem. They reveal the achievable regret can be substantially reduced when compared to the unstructured setup, possibly by a factor q. However, solving exactly the exact constrained optimization problem involves a combinatorial problem. We introduce a lowerbound inspired strategy involving a computationally efficient relaxation that is based on a sorting mechanism. We further prove it achieves a lower bound close to the optimal one up to a controlled factor, and achieves an asymptotic regret q times smaller than the unstructured one. We believe this shows it is a valuable strategy for the practitioner. Last, we illustrate the performance of the considered strategy on numerical experiments involving a large number of arms.
Accès au texte intégral et bibtex

titre: Routine Bandits: Minimizing Regret on Recurring Problems
auteur: Hassan Saber, Léo Saci, Odalric-Ambrym Maillard, Audrey Durand
article: ECML-PKDD 2021, Sep 2021, Bilbao, Spain
resume: We study a variant of the multi-armed bandit problem in which a learner faces every day one of B many bandit instances, and call it a routine bandit. More specifically, at each period h ∈ [1, H] , the same bandit b^h is considered during T > 1 consecutive time steps, but the identity b^h is unknown to the learner. We assume all rewards distribution are Gaussian standard. Such a situation typically occurs in recommender systems when a learner may repeatedly serve the same user whose identity is unknown due to privacy issues. By combining banditidentification tests with a KLUCB type strategy, we introduce the KLUCB for Routine Bandits (KLUCB-RB) algorithm. While independently running KLUCB algorithm at each period leads to a cumulative expected regret of Ω(H log T) after H many periods when T → ∞, KLUCB-RB benefits from previous periods by aggregating observations from similar identified bandits, which yields a non-trivial scaling of Ω(log T). This is achieved without knowing which bandit instance is being faced by KLUCB-RB on this period, nor knowing a priori the number of possible bandit instances. We provide numerical illustration that confirm the benefit of KLUCB-RB while using less information about the problem compared with existing strategies for similar problems.
Accès au texte intégral et bibtex

titre: Optimal Thompson Sampling strategies for support-aware CVaR bandits
auteur: Dorian Baudry, Romain Gautron, Emilie Kaufmann, Odalric-Ambrym Maillard
article: 38th International Conference on Machine Learning, Jul 2021, Virtual, United States
resume: In this paper we study a multi-arm bandit problem in which the quality of each arm is measured by the Conditional Value at Risk (CVaR) at some level alpha of the reward distribution. While existing works in this setting mainly focus on Upper Confidence Bound algorithms, we introduce a new Thompson Sampling approach for CVaR bandits on bounded rewards that is flexible enough to solve a variety of problems grounded on physical resources. Building on a recent work by Riou & Honda (2020), we introduce B-CVTS for continuous bounded rewards and M-CVTS for multinomial distributions. On the theoretical side, we provide a non-trivial extension of their analysis that enables to theoretically bound their CVaR regret minimization performance. Strikingly, our results show that these strategies are the first to provably achieve asymptotic optimality in CVaR bandits, matching the corresponding asymptotic lower bounds for this setting. Further, we illustrate empirically the benefit of Thompson Sampling approaches both in a realistic environment simulating a use-case in agriculture and on various synthetic examples.
Accès au texte intégral et bibtex

titre: Learning Value Functions in Deep Policy Gradients using Residual Variance
auteur: Yannis Flet-Berliac, Reda Ouhamma, Odalric-Ambrym Maillard, Philippe Preux
article: ICLR 2021 - International Conference on Learning Representations, May 2021, Vienna / Virtual, Austria
resume: Policy gradient algorithms have proven to be successful in diverse decision making and control tasks. However, these methods suffer from high sample complexity and instability issues. In this paper, we address these challenges by providing a different approach for training the critic in the actor-critic framework. Our work builds on recent studies indicating that traditional actor-critic algorithms do not succeed in fitting the true value function, calling for the need to identify a better objective for the critic. In our method, the critic uses a new state-value (resp. state-action-value) function approximation that learns the value of the states (resp. state-action pairs) relative to their mean value rather than the absolute value as in conventional actor-critic. We prove the theoretical consistency of the new gradient estimator and observe dramatic empirical improvement across a variety of continuous control tasks and algorithms. Furthermore, we validate our method in tasks with sparse rewards, where we provide experimental evidence and theoretical insights.
Accès au texte intégral et bibtex

titre: Improved Exploration in Factored Average-Reward MDPs
auteur: Sadegh Talebi, Anders Jonsson, Odalric-Ambrym Maillard
article: 24th International Conference on Artificial Intelligence and Statistics, 2021, San diego (virtual), United States
resume: We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the stateaction space X and the state-space S admit the respective factored forms of X = ⊗ n i=1 X i and S = ⊗ m i=1 S i , and the transition and reward functions are factored over X and S. Assuming known factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL2 strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of S i 's and the involved diameter-related terms. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys a substantially improved regret empirically over existing algorithms that have frequentist regret guarantees.
Accès au texte intégral et bibtex

titre: Reinforcement Learning in Parametric MDPs with Exponential Families
auteur: Sayak Ray Chowdhury, Aditya Gopalan, Odalric-Ambrym Maillard
article: International Conference on Artificial Intelligence and Statistics, 2021, San diego, United States. pp.1855-1863
Accès au texte intégral et bibtex

2020

Conference papers

titre: Robust-Adaptive Interval Predictive Control for Linear Uncertain Systems
auteur: Edouard Leurent, Denis Efimov, Odalric-Ambrym Maillard
article: CDC 2020 - 59th IEEE Conference on Decision and Control, Dec 2020, Jeju Island / Virtual, South Korea
resume: We consider the problem of stabilization of a linear system, under state and control constraints, and subject to bounded disturbances and unknown parameters in the state matrix. First, using a simple least square solution and available noisy measurements, the set of admissible values for parameters is evaluated. Second, for the estimated set of parameter values and the corresponding linear interval model of the system, two interval predictors are recalled and an unconstrained stabilizing control is designed that uses the predicted intervals. Third, to guarantee the robust constraint satisfaction, a model predictive control algorithm is developed, which is based on solution of an optimization problem posed for the interval predictor. The conditions for recursive feasibility and asymptotic performance are established. Efficiency of the proposed control framework is illustrated by numeric simulations.
Accès au texte intégral et bibtex

titre: Sub-sampling for Efficient Non-Parametric Bandit Exploration
auteur: Dorian Baudry, Emilie Kaufmann, Odalric-Ambrym Maillard
article: NeurIPS 2020, Dec 2020, Vancouver, Canada
resume: In this paper we propose the first multi-armed bandit algorithm based on re-sampling that achieves asymptotically optimal regret simultaneously for different families of arms (namely Bernoulli, Gaussian and Poisson distributions). Unlike Thompson Sampling which requires to specify a different prior to be optimal in each case, our proposal RB-SDA does not need any distribution-dependent tuning. RB-SDA belongs to the family of Sub-sampling Duelling Algorithms (SDA) which combines the sub-sampling idea first used by the BESA [1] and SSMC [2] algorithms with different sub-sampling schemes. In particular, RB-SDA uses Random Block sampling. We perform an experimental study assessing the flexibility and robustness of this promising novel approach for exploration in bandit models.
Accès au texte intégral et bibtex

titre: Robust-Adaptive Control of Linear Systems: beyond Quadratic Costs
auteur: Edouard Leurent, Denis Efimov, Odalric-Ambrym Maillard
article: NeurIPS 2020 - 34th Conference on Neural Information Processing Systems, Dec 2020, Vancouver / Virtual, Canada
resume: We consider the problem of robust and adaptive model predictive control (MPC) of a linear system, with unknown parameters that are learned along the way (adaptive), in a critical setting where failures must be prevented (robust). This problem has been studied from different perspectives by different communities. However, the existing theory deals only with the case of quadratic costs (the LQ problem), which limits applications to stabilisation and tracking tasks only. In order to handle more general (non-convex) costs that naturally arise in many practical problems, we carefully select and bring together several tools from different communities, namely non-asymptotic linear regression, recent results in interval prediction, and tree-based planning. Combining and adapting the theoretical guarantees at each layer is non trivial, and we provide the first end-to-end suboptimality analysis for this setting. Interestingly, our analysis naturally adapts to handle many models and combines with a data-driven robust model selection strategy, which enables to relax the modelling assumptions. Last, we strive to preserve tractability at any stage of the method, that we illustrate on two challenging simulated environments.
Accès au texte intégral et bibtex

titre: Monte-Carlo Graph Search: the Value of Merging Similar States
auteur: Edouard Leurent, Odalric-Ambrym Maillard
article: ACML 2020 - 12th Asian Conference on Machine Learning, Nov 2020, Bangkok / Virtual, Thailand. pp.577 - 602
resume: We consider the problem of planning in a Markov Decision Process (MDP) with a generative model and limited computational budget. Despite the underlying MDP transitions having a graph structure, the popular Monte-Carlo Tree Search algorithms such as UCT rely on a tree structure to represent their value estimates. That is, they do not identify together two similar states reached via different trajectories and represented in separate branches of the tree. In this work, we propose a graph-based planning algorithm, which takes into account this state similarity. In our analysis, we provide a regret bound that depends on a novel problem-dependent measure of difficulty, which improves on the original tree-based bound in MDPs where the trajectories overlap, and recovers it otherwise. Then, we show that this methodology can be adapted to existing planning algorithms that deal with stochastic systems. Finally, numerical simulations illustrate the benefits of our approach.
Accès au texte intégral et bibtex

titre: Tightening Exploration in Upper Confidence Reinforcement Learning
auteur: Hippolyte Bourel, Odalric-Ambrym Maillard, Mohammad Sadegh Talebi
article: International Conference on Machine Learning, Jul 2020, Vienna, Austria
resume: The upper confidence reinforcement learning (UCRL2) algorithm introduced in (Jaksch et al., 2010) is a popular method to perform regret minimization in unknown discrete Markov Decision Processes under the average-reward criterion. Despite its nice and generic theoretical regret guarantees , this algorithm and its variants have remained until now mostly theoretical as numerical experiments in simple environments exhibit long burn-in phases before the learning takes place. In pursuit of practical efficiency, we present UCRL3, following the lines of UCRL2, but with two key modifications: First, it uses state-of-the-art time-uniform concentration inequalities to compute confidence sets on the reward and (component-wise) transition distributions for each state-action pair. Furthermore , to tighten exploration, it uses an adap-tive computation of the support of each transition distribution, which in turn enables us to revisit the extended value iteration procedure of UCRL2 to optimize over distributions with reduced support by disregarding low probability transitions, while still ensuring near-optimism. We demonstrate , through numerical experiments in standard environments, that reducing exploration this way yields a substantial numerical improvement compared to UCRL2 and its variants. On the theoretical side, these key modifications enable us to derive a regret bound for UCRL3 improving on UCRL2, that for the first time makes appear notions of local diameter and local effective support, thanks to variance-aware concentration bounds.
Accès au texte intégral et bibtex

titre: Restarted Bayesian Online Change-point Detector achieves Optimal Detection Delay
auteur: Réda Alami, Odalric-Ambrym Maillard, Raphael Féraud
article: International Conference on Machine Learning, Jul 2020, Wien, Austria
resume: In this paper, we consider the problem of sequential change-point detection where both the changepoints and the distributions before and after the change are assumed to be unknown. For this problem of primary importance in statistical and sequential learning theory, we derive a variant of the Bayesian Online Change Point Detector proposed by (Fearnhead & Liu, 2007) which is easier to analyze than the original version while keeping its powerful message-passing algorithm. We provide a non-asymptotic analysis of the false-alarm rate and the detection delay that matches the existing lower-bound. We further provide the first explicit high-probability control of the detection delay for such approach. Experiments on synthetic and realworld data show that this proposal outperforms the state-of-art change-point detection strategy, namely the Improved Generalized Likelihood Ratio (Improved GLR) while compares favorably with the original Bayesian Online Change Point Detection strategy.
Accès au texte intégral et bibtex

Preprints, Working Papers, ...

titre: Optimal Strategies for Graph-Structured Bandits
auteur: Hassan Saber, Pierre Ménard, Odalric-Ambrym Maillard
article: 2020
resume: We study a structured variant of the multi-armed bandit problem specified by a set of Bernoulli distributions $ \nu \!= \!(\nu_{a,b})_{a \in \mathcal{A}, b \in \mathcal{B}}$ with means $(\mu_{a,b})_{a \in \mathcal{A}, b \in \mathcal{B}}\!\in\![0,1]^{\mathcal{A}\times\mathcal{B}}$ and by a given weight matrix $\omega\!=\! (\omega_{b,b'})_{b,b' \in \mathcal{B}}$, where $ \mathcal{A}$ is a finite set of arms and $ \mathcal{B} $ is a finite set of users. The weight matrix $\omega$ is such that for any two users $b,b'\!\in\!\mathcal{B}, \text{max}_{a\in\mathcal{A}}|\mu_{a,b} \!-\! \mu_{a,b'}| \!\leq\! \omega_{b,b'} $. This formulation is flexible enough to capture various situations, from highly-structured scenarios ($\omega\!\in\!\{0,1\}^{\mathcal{B}\times\mathcal{B}}$) to fully unstructured setups ($\omega\!\equiv\! 1$). We consider two scenarios depending on whether the learner chooses only the actions to sample rewards from or both users and actions. We first derive problem-dependent lower bounds on the regret for this generic graph-structure that involves a structure dependent linear programming problem. Second, we adapt to this setting the Indexed Minimum Empirical Divergence (IMED) algorithm introduced by Honda and Takemura (2015), and introduce the IMED-GS$^\star$ algorithm. Interestingly, IMED-GS$^\star$ does not require computing the solution of the linear programming problem more than about $\log(T)$ times after $T$ steps, while being provably asymptotically optimal. Also, unlike existing bandit strategies designed for other popular structures, IMED-GS$^\star$ does not resort to an explicit forced exploration scheme and only makes use of local counts of empirical events. We finally provide numerical illustration of our results that confirm the performance of IMED-GS$^\star$.
Accès au texte intégral et bibtex

titre: Forced-exploration free Strategies for Unimodal Bandits
auteur: Hassan Saber, Pierre Ménard, Odalric-Ambrym Maillard
article: 2020
resume: We consider a multi-armed bandit problem specified by a set of Gaussian or Bernoulli distributions endowed with a unimodal structure. Although this problem has been addressed in the literature (Combes and Proutiere, 2014), the state-of-the-art algorithms for such structure make appear a forced-exploration mechanism. We introduce IMED-UB, the first forced-exploration free strategy that exploits the unimodal-structure, by adapting to this setting the Indexed Minimum Empirical Divergence (IMED) strategy introduced by Honda and Takemura (2015). This strategy is proven optimal. We then derive KLUCB-UB, a KLUCB version of IMED-UB, which is also proven optimal. Owing to our proof technique, we are further able to provide a concise finite-time analysis of both strategies in an unified way. Numerical experiments show that both IMED-UB and KLUCB-UB perform similarly in practice and outperform the state-of-the-art algorithms.
Accès au texte intégral et bibtex

2019

Conference papers

titre: Budgeted Reinforcement Learning in Continuous State Space
auteur: Nicolas Carrara, Edouard Leurent, Romain Laroche, Tanguy Urvoy, Odalric-Ambrym Maillard, Olivier Pietquin
article: Conference on Neural Information Processing Systems, Dec 2019, Vancouver, Canada
resume: A Budgeted Markov Decision Process (BMDP) is an extension of a Markov Decision Process to critical applications requiring safety constraints. It relies on a notion of risk implemented in the shape of a cost signal constrained to lie below an-adjustable-threshold. So far, BMDPs could only be solved in the case of finite state spaces with known dynamics. This work extends the state-of-the-art to continuous spaces environments and unknown dynamics. We show that the solution to a BMDP is a fixed point of a novel Budgeted Bellman Optimality operator. This observation allows us to introduce natural extensions of Deep Reinforcement Learning algorithms to address large-scale BMDPs. We validate our approach on two simulated applications: spoken dialogue and autonomous driving.
Accès au texte intégral et bibtex

titre: Learning Multiple Markov Chains via Adaptive Allocation
auteur: Mohammad Sadegh Talebi, Odalric-Ambrym Maillard
article: Advances in Neural Information Processing Systems 32 (NIPS 2019), Dec 2019, Vancouver, Canada
resume: We study the problem of learning the transition matrices of a set of Markov chains from a single stream of observations on each chain. We assume that the Markov chains are ergodic but otherwise unknown. The learner can sample Markov chains sequentially to observe their states. The goal of the learner is to sequentially select various chains to learn transition matrices uniformly well with respect to some loss function. We introduce a notion of loss that naturally extends the squared loss for learning distributions to the case of Markov chains, and further characterize the notion of being uniformly good in all problem instances. We present a novel learning algorithm that efficiently balances exploration and exploitation intrinsic to this problem, without any prior knowledge of the chains. We provide finite-sample PAC-type guarantees on the performance of the algorithm. Further, we show that our algorithm asymptotically attains an optimal loss.
Accès au texte intégral et bibtex

titre: Regret Bounds for Learning State Representations in Reinforcement Learning
auteur: Ronald Ortner, Matteo Pirotta, Ronan Fruit, Alessandro Lazaric, Odalric-Ambrym Maillard
article: Conference on Neural Information Processing Systems, Dec 2019, Vancouver, Canada
resume: We consider the problem of online reinforcement learning when several state representations (mapping histories to a discrete state space) are available to the learning agent. At least one of these representations is assumed to induce a Markov decision process (MDP), and the performance of the agent is measured in terms of cumulative regret against the optimal policy giving the highest average reward in this MDP representation. We propose an algorithm (UCB-MS) with O(√ T) regret in any communicating MDP. The regret bound shows that UCB-MS automatically adapts to the Markov model and improves over the currently known best bound of order O(T 2/3).
Accès au texte intégral et bibtex

titre: Model-Based Reinforcement Learning Exploiting State-Action Equivalence
auteur: Mahsa Asadi, Mohammad Sadegh Talebi, Hippolyte Bourel, Odalric-Ambrym Maillard
article: ACML 2019, Proceedings of Machine Learning Research, Nov 2019, Nagoya, Japan. pp.204 - 219
resume: Leveraging an equivalence property in the state-space of a Markov Decision Process (MDP) has been investigated in several studies. This paper studies equivalence structure in the reinforcement learning (RL) setup, where transition distributions are no longer assumed to be known. We present a notion of similarity between transition probabilities of various state-action pairs of an MDP, which naturally defines an equivalence structure in the state-action space. We present equivalence-aware confidence sets for the case where the learner knows the underlying structure in advance. These sets are provably smaller than their corresponding equivalence-oblivious counterparts. In the more challenging case of an unknown equivalence structure, we present an algorithm called ApproxEquivalence that seeks to find an (approximate) equivalence structure, and define confidence sets using the approximate equivalence. To illustrate the efficacy of the presented confidence sets, we present C-UCRL, as a natural modification of UCRL2 for RL in undiscounted MDPs. In the case of a known equivalence structure, we show that C-UCRL improves over UCRL2 in terms of regret by a factor of SA/C, in any communicating MDP with S states, A actions, and C classes, which corresponds to a massive improvement when C SA. To the best of our knowledge, this is the first work providing regret bounds for RL when an equivalence structure in the MDP is efficiently exploited. In the case of an unknown equivalence structure, we show through numerical experiments that C-UCRL combined with ApproxEquivalence outperforms UCRL2 in ergodic MDPs.
Accès au texte intégral et bibtex

titre: Practical Open-Loop Optimistic Planning
auteur: Edouard Leurent, Odalric-Ambrym Maillard
article: European Conference on Machine Learning, Sep 2019, Würzburg, Germany
resume: We consider the problem of online planning in a Markov Decision Process when given only access to a generative model, restricted to open-loop policies-i.e. sequences of actions-and under budget constraint. In this setting, the Open-Loop Optimistic Planning (OLOP) algorithm enjoys good theoretical guarantees but is overly conservative in practice, as we show in numerical experiments. We propose a modified version of the algorithm with tighter upper-confidence bounds, KL-OLOP, that leads to better practical performances while retaining the sample complexity bound. Finally, we propose an efficient implementation that significantly improves the time complexity of both algorithms.
Accès au texte intégral et bibtex

titre: Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds
auteur: Odalric-Ambrym Maillard
article: Algorithmic Learning Theory, 2019, Chicago, United States. pp.1 - 23
resume: We consider change-point detection in a fully sequential setup, when observations are received one by one and one must raise an alarm as early as possible after any change. We assume that both the change points and the distributions before and after the change are unknown. We consider the class of piecewise-constant mean processes with sub-Gaussian noise, and we target a detection strategy that is uniformly good on this class (this constrains the false alarm rate and detection delay). We introduce a novel tuning of the GLR test that takes here a simple form involving scan statistics, based on a novel sharp concentration inequality using an extension of the Laplace method for scan-statistics that holds doubly-uniformly in time. This also considerably simplifies the implementation of the test and analysis. We provide (perhaps surprisingly) the first fully non-asymptotic analysis of the detection delay of this test that matches the known existing asymptotic orders, with fully explicit numerical constants. Then, we extend this analysis to allow some changes that are not-detectable by any uniformly-good strategy (the number of observations before and after the change are too small for it to be detected by any such algorithm), and provide the first robust, finite-time analysis of the detection delay.
Accès au texte intégral et bibtex

Habilitation à diriger des recherches

titre: Mathematics of Statistical Sequential Decision Making
auteur: Odalric-Ambrym Maillard
article: Mathematics [math]. Université de Lille, Sciences et Technologies, 2019
resume: In this document, we give an overview of recent contributions to the mathematics of statistical sequential learning. Unlike research articles that start from a motivating example and provide little room to the mathematical tools in the main body of the article, we here give primary focus to these tools, in order to stress their potential as well as their role in the development of improved algorithms and proof techniques in the field. We revisit in particular properties of the log Laplace transform of a random variable, the handling of random stopping time in concentration of measure of empirical distributions, and we highlight the fundamental role of the “change of measure” argument both in the construction of performance lower-bounds as well as near-optimal strategies. We then give focus to obtaining finite-time error guarantees on the parameter estimation in parametric models before highlighting the strength of Legendre-Fenchel duality in the design of risk-averse and robust strategies. Finally, we turn the setting of Markov decision processes where we present some key insights for the development of the next generation of decision strategies. We end this manuscript by providing a more focused presentation of three key contributions in bandit theory, stochastic automata, and aggregation of experts
Accès au texte intégral et bibtex

Preprints, Working Papers, ...

titre: Active Roll-outs in MDP with Irreversible Dynamics
auteur: Odalric-Ambrym Maillard, Timothy Mann, Ronald Ortner, Shie Mannor
article: 2019
resume: In Reinforcement Learning (RL), regret guarantees scaling with the square root of the time horizon have been shown to hold only for communicating Markov decision processes (MDPs) where any two states are connected. This essentially means that an algorithm can eventually recover from any mistake. However, real-world tasks usually include situations where taking a single "bad" action can permanently trap a learner in a suboptimal region of the state-space. Since it is provably impossible to achieve sub-linear regret in general multi-chain MDPs, we assume a weak mechanism that allows the learner to request additional information. Our main contribution is to address: (i) how much external information is needed, (ii) how and when to use it, and (iii) how much regret is incurred. We design an algorithm that minimizes requests for external information in the form of rollouts of a policy specified by the learner by actively requesting it only when needed. The algorithm provably achieves O(√ T) active regret after T steps in a large class of multi-chain MDPs, by only requesting O(log(T)) rollout transitions. The superiority of our algorithm to standard algorithms such as R-Max and UCRL is demonstrated in experiments on some illustrative grid-world examples. (a) (b) (c) Figure 1: Example of (a) a communicating MDP, (b) a unichain MDP with a single recurrent class, and (c) a multi-chain MDP with two recurrent classes. The circles represent states while the labeled edges represent transitions due to executing actions {a, b, c}.
Accès au texte intégral et bibtex

2018

Journal articles

titre: Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs
auteur: Mohammad Sadegh Talebi, Odalric-Ambrym Maillard
article: Journal of Machine Learning Research, 2018, pp.1-36
resume: The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of the diameter of the MDP. Furthermore, we provide a novel analysis of the KL-Ucrl algorithm establishing a high-probability regret bound scaling as O S s,a V s,a T for this algorithm for ergodic MDPs, where S denotes the number of states and where V s,a is the variance of the bias function with respect to the next-state distribution following action a in state s. The resulting bound improves upon the best previously known regret bound O(DS √ AT) for that algorithm, where A and D respectively denote the maximum number of actions (per state) and the diameter of MDP. We finally compare the leading terms of the two bounds in some benchmark MDPs indicating that the derived bound can provide an order of magnitude improvement in some cases. Our analysis leverages novel variations of the transportation lemma combined with Kullback-Leibler concentration inequalities, that we believe to be of independent interest.
Accès au texte intégral et bibtex

titre: Streaming kernel regression with provably adaptive mean, variance, and regularization
auteur: Audrey Durand, Odalric-Ambrym Maillard, Joelle Pineau
article: Journal of Machine Learning Research, 2018, 1, pp.1 - 48
resume: We consider the problem of streaming kernel regression, when the observations arrive sequentially and the goal is to recover the underlying mean function, assumed to belong to an RKHS. The variance of the noise is not assumed to be known. In this context, we tackle the problem of tuning the regularization parameter adaptively at each time step, while maintaining tight confidence bounds estimates on the value of the mean function at each point. To this end, we first generalize existing results for finite-dimensional linear regression with fixed regularization and known variance to the kernel setup with a regularization parameter allowed to be a measurable function of past observations. Then, using appropriate self-normalized inequalities we build upper and lower bound estimates for the variance, leading to Bersntein-like concentration bounds. The later is used in order to define the adaptive regularization. The bounds resulting from our technique are valid uniformly over all observation points and all time steps, and are compared against the literature with numerical experiments. Finally, the potential of these tools is illustrated by an application to kernelized bandits, where we revisit the Kernel UCB and Kernel Thompson Sampling procedures, and show the benefits of the novel adaptive kernel tuning strategy.
Accès au texte intégral et bibtex

titre: Boundary Crossing Probabilities for General Exponential Families
auteur: Odalric-Ambrym Maillard
article: Mathematical Methods of Statistics, 2018, 27, pp.1-31. ⟨10.3103/S1066530718010015⟩
Accès au texte intégral et bibtex

Conference papers

titre: Approximate Robust Control of Uncertain Dynamical Systems
auteur: Edouard Leurent, Yann Blanco, Denis Efimov, Odalric-Ambrym Maillard
article: Proc. MLITS Workshop at NeurIPS, Dec 2018, Montreal, Canada
resume: This work studies the design of safe control policies for large-scale non-linear systems operating in uncertain environments. In such a case, the robust control framework is a principled approach to safety that aims to maximize the worst-case performance of a system. However, the resulting optimization problem is generally intractable for non-linear systems with continuous states. To overcome this issue, we introduce two tractable methods that are based either on sampling or on a conservative approximation of the robust objective. The proposed approaches are applied to the problem of autonomous driving.
Accès au texte intégral et bibtex

Poster communications

titre: Memory Bandits: Towards the Switching Bandit Problem Best Resolution
auteur: Réda Alami, Odalric-Ambrym Maillard, Raphaël Féraud
article: MLSS 2018 - Machine Learning Summer School, Aug 2018, Madrid, Spain
Accès au texte intégral et bibtex

Preprints, Working Papers, ...

titre: Upper Confidence Reinforcement Learning exploiting state-action equivalence
auteur: Odalric-Ambrym Maillard, Mahsa Asadi
article: 2018
resume: Leveraging an equivalence property on the set of states of state-action pairs in an Markov Decision Process (MDP) has been suggested by many authors. We take the study of equivalence classes to the reinforcement learning (RL) setup, when transition distributions are no longer assumed to be known, in a discrete MDP with average reward criterion and no reset. We study powerful similarities between state-action pairs related to optimal transport. We first analyze a variant of the UCRL2 algorithm called C-UCRL2, which highlights the clear benefit of leveraging this equivalence structure when it is known ahead of time: the regret bound scales as ~O(D√KCT) where C is the number of classes of equivalent state-action pairs and K bounds the size of the support of the transitions. A non trivial question is whether this benefit can still be observed when the structure is unknown and must be learned while minimizing the regret. We propose a sound clustering technique that provably learn the unknown classes, but show that its natural combination with UCRL2 empirically fails. Our findings suggests this is due to the ad-hoc criterion for stopping the episodes in UCRL2. We replace it with hypothesis testing, which in turns considerably improves all strategies. It is then empirically validated that learning the structure can be beneficial in a full-blown RL problem.
Accès au texte intégral et bibtex

2017

Journal articles

titre: The Non-stationary Stochastic Multi-armed Bandit Problem
auteur: Robin Allesiardo, Raphaël Féraud, Odalric-Ambrym Maillard
article: International Journal of Data Science and Analytics, 2017, 3 (4), pp.267-283. ⟨10.1007/s41060-017-0050-5⟩
resume: We consider a variant of the stochastic multi-armed bandit with K arms where the rewards are not assumed to be identically distributed, but are generated by a non-stationary stochastic process. We first study the unique best arm setting when there exists one unique best arm. Second, we study the general switching best arm setting when a best arm switches at some unknown steps. For both settings, we target problem-dependent bounds, instead of the more conservative problem-free bounds. We consider two classical problems: (1) identify a best arm with high probability (best arm identification), for which the performance measure by the sample complexity (number of samples before finding a near-optimal arm). To this end, we naturally extend the definition of sample complexity so that it makes sense in the switching best arm setting, which may be of independent interest. (2) Achieve the smallest cumulative regret (regret minimization) where the regret is measured with respect to the strategy pulling an arm with the best instantaneous mean at each step.
Accès au texte intégral et bibtex

Conference papers

titre: Efficient tracking of a growing number of experts
auteur: Jaouad Mourtada, Odalric-Ambrym Maillard
article: Algorithmic Learning Theory, Oct 2017, Tokyo, Japan. pp.1 - 23
resume: We consider a variation on the problem of prediction with expert advice, where new forecasters that were unknown until then may appear at each round. As often in prediction with expert advice, designing an algorithm that achieves near-optimal regret guarantees is straightforward, using ag-gregation of experts. However, when the comparison class is sufficiently rich, for instance when the best expert and the set of experts itself changes over time, such strategies naively require to maintain a prohibitive number of weights (typically exponential with the time horizon). By contrast, designing strategies that both achieve a near-optimal regret and maintain a reasonable number of weights is highly non-trivial. We consider three increasingly challenging objectives (simple regret, shifting regret and sparse shifting regret) that extend existing notions defined for a fixed expert ensemble; in each case, we design strategies that achieve tight regret bounds, adaptive to the parameters of the comparison class, while being computationally inexpensive. Moreover, our algorithms are anytime , agnostic to the number of incoming experts and completely parameter-free. Such remarkable results are made possible thanks to two simple but highly effective recipes: first the " abstention trick " that comes from the specialist framework and enables to handle the least challenging notions of regret, but is limited when addressing more sophisticated objectives. Second, the " muting trick " that we introduce to give more flexibility. We show how to combine these two tricks in order to handle the most challenging class of comparison strategies.
Accès au texte intégral et bibtex

titre: Boundary Crossing for General Exponential Families
auteur: Odalric-Ambrym Maillard
article: Algorithmic Learning Theory, Oct 2017, Kyoto, Japan. pp.1 - 34
resume: We consider parametric exponential families of dimension K on the real line. We study a variant of boundary crossing probabilities coming from the multi-armed bandit literature, in the case when the real-valued distributions form an exponential family of dimension K. Formally, our result is a concentration inequality that bounds the probability that B ψ (θ n , θ) f (t/n)/n, where θ is the parameter of an unknown target distribution, θ n is the empirical parameter estimate built from n observations, ψ is the log-partition function of the exponential family and B ψ is the corresponding Bregman divergence. From the perspective of stochastic multi-armed bandits, we pay special attention to the case when the boundary function f is logarithmic, as it enables to analyze the regret of the state-of-the-art KL-ucb and KL-ucb+ strategies, whose analysis was left open in such generality. Indeed, previous results only hold for the case when K = 1, while we provide results for arbitrary finite dimension K, thus considerably extending the existing results. Perhaps surprisingly, we highlight that the proof techniques to achieve these strong results already existed three decades ago in the work of T.L. Lai, and were apparently forgotten in the bandit community. We provide a modern rewriting of these beautiful techniques that we believe are useful beyond the application to stochastic multi-armed bandits.
Accès au texte intégral et bibtex

titre: Spectral Learning from a Single Trajectory under Finite-State Policies
auteur: Borja Balle, Odalric-Ambrym Maillard
article: International conference on Machine Learning, Jul 2017, Sidney, France
resume: We present spectral methods of moments for learning sequential models from a single trajec-tory, in stark contrast with the classical literature that assumes the availability of multiple i.i.d. trajectories. Our approach leverages an efficient SVD-based learning algorithm for weighted au-tomata and provides the first rigorous analysis for learning many important models using dependent data. We state and analyze the algorithm under three increasingly difficult scenarios: proba-bilistic automata, stochastic weighted automata, and reactive predictive state representations controlled by a finite-state policy. Our proofs include novel tools for studying mixing properties of stochastic weighted automata.
Accès au texte intégral et bibtex

Lectures

titre: Basic Concentration Properties of Real-Valued Distributions
auteur: Odalric-Ambrym Maillard
article: Doctoral. France. 2017
resume: In this note we introduce and discuss a few concentration tools for the study of concentration inequalities on the real line. After recalling versions of the Chernoff method, we move to concentration inequalities for predictable processes. We especially focus on bounds that enable to handle the sum of real-valued random variables, where the number of summands is itself a random stopping time, and target fully explicit and empirical bounds. We then discuss some important other tools, such as the Laplace method and the transportation lemma.
Accès au texte intégral et bibtex

2016

Conference papers

titre: Pliable rejection sampling
auteur: Akram Erraqabi, Michal Valko, Alexandra Carpentier, Odalric-Ambrym Maillard
article: International Conference on Machine Learning, Jun 2016, New York City, United States
resume: Rejection sampling is a technique for sampling from difficult distributions. However, its use is limited due to a high rejection rate. Common adaptive rejection sampling methods either work only for very specific distributions or without performance guarantees. In this paper, we present pliable rejection sampling (PRS), a new approach to rejection sampling, where we learn the sampling proposal using a kernel estimator. Since our method builds on rejection sampling, the samples obtained are with high probability i.i.d. and distributed according to f. Moreover, PRS comes with a guarantee on the number of accepted samples.
Accès au texte intégral et bibtex

Preprints, Working Papers, ...

titre: Random Shuffling and Resets for the Non-stationary Stochastic Bandit Problem
auteur: Robin Allesiardo, Raphaël Féraud, Odalric-Ambrym Maillard
article: 2016
resume: We consider a non-stationary formulation of the stochastic multi-armed bandit where the rewards are no longer assumed to be identically distributed. For the best-arm identification task, we introduce a version of SUCCESSIVE ELIMINATION based on random shuffling of the K arms. We prove that under a novel and mild assumption on the mean gap ∆, this simple but powerful modification achieves the same guarantees in term of sample complexity and cumulative regret than its original version, but in a much wider class of problems, as it is not anymore constrained to stationary distributions. We also show that the original SUCCESSIVE ELIMINATION fails to have controlled regret in this more general scenario, thus showing the benefit of shuffling. We then remove our mild assumption and adapt the algorithm to the best-arm identification task with switching arms. We adapt the definition of the sample complexity for that case and prove that, against an optimal policy with N − 1 switches of the optimal arm, this new algorithm achieves an expected sample complexity of O(∆^{−2} sqrt(N Kdelta^{−1} log(K/delta)), where δ is the probability of failure of the algorithm, and an expected cumulative regret of O(∆^{−1} sqrt(N T K log(T K))) after T time steps.
Accès au texte intégral et bibtex

titre: Low-rank Bandits with Latent Mixtures
auteur: Aditya Gopalan, Odalric-Ambrym Maillard, Mohammadi Zaki
article: 2016
resume: We study the task of maximizing rewards from recommending items (actions) to users sequentially interacting with a recommender system. Users are modeled as latent mixtures of C many representative user classes, where each class specifies a mean reward profile across actions. Both the user features (mixture distribution over classes) and the item features (mean reward vector per class) are unknown a priori. The user identity is the only contextual information available to the learner while interacting. This induces a low-rank structure on the matrix of expected rewards r a,b from recommending item a to user b. The problem reduces to the well-known linear bandit when either user-or item-side features are perfectly known. In the setting where each user, with its stochastically sampled taste profile, interacts only for a small number of sessions, we develop a bandit algorithm for the two-sided uncertainty. It combines the Robust Tensor Power Method of Anandkumar et al. (2014b) with the OFUL linear bandit algorithm of Abbasi-Yadkori et al. (2011). We provide the first rigorous regret analysis of this combination, showing that its regret after T user interactions is˜O(C √ BT), with B the number of users. An ingredient towards this result is a novel robustness property of OFUL, of independent interest.
Accès au texte intégral et bibtex

titre: Self-normalization techniques for streaming confident regression
auteur: Odalric-Ambrym Maillard
article: 2016
resume: We consider, in a generic streaming regression setting, the problem of building a confidence interval (and distribution) on the next observation based on past observed data. The observations given to the learner are of the form (x, y) with y = f (x) + ξ, where x can have arbitrary dependency on the past observations, f is unknown and the noise ξ is sub-Gaussian conditionally on the past observations. Further, the observations are assumed to come from some external filtering process making the number of observations itself a random stopping time. In this challenging scenario that captures a large class of processes with non-anticipative dependencies, we study the ordinary, ridge, and kernel least-squares estimates and provide confidence intervals based on self-normalized vector-valued martingale techniques, applied to the estimation of the mean and of the variance. We then discuss how these adaptive confidence intervals can be used in order to detect a possible model mismatch as well as to estimate the future (self-information, quadratic, or transportation) loss of the learner at a next step.
Accès au texte intégral et bibtex

2015

Journal articles

titre: Concentration inequalities for sampling without replacement
auteur: Rémi Bardenet, Odalric-Ambrym Maillard
article: Bernoulli, 2015, 21 (3), pp.1361-1385. ⟨10.3150/14-BEJ605⟩
resume: Concentration inequalities quantify the deviation of a random variable from a fixed value. In spite of numerous applications, such as opinion surveys or ecological counting procedures , few concentration results are known for the setting of sampling without replacement from a finite population. Until now, the best general concentration inequality has been a Hoeffding inequality due to ?. In this paper, we first improve on the fundamental result of ?, and further extend it to obtain a Bernstein concentration bound for sampling without replacement. We then derive an empirical version of our bound that does not require the variance to be known to the user.
Accès au texte intégral et bibtex

Preprints, Working Papers, ...

titre: A note on replacing uniform subsampling by random projections in MCMC for linear regression of tall datasets
auteur: Rémi Bardenet, Odalric-Ambrym Maillard
article: 2015
resume: New Markov chain Monte Carlo (MCMC) methods have been proposed to tackle inference with tall datasets, i.e., when the number n of data items is intractably large. A large class of these new MCMC methods is based on randomly subsampling the dataset at each MCMC iteration. We investigate whether random projections can replace this random subsampling for linear regression of big streaming data. In the latter setting, random projections have indeed become standard for non-Bayesian treatments. We isolate two issues for MCMC to apply to streaming regression: 1) a resampling issue; MCMC should access the same random projections across iterations to avoid keeping the whole dataset in memory and 2) a budget issue; making individual MCMC acceptance decisions should require o(n) random projections. While the resampling issue can be satisfyingly tackled, current techniques in random projections and MCMC for tall data do not solve the budget issue, and may well end up showing it is not possible.
Accès au texte intégral et bibtex

2014

Journal articles

titre: Sub-sampling for Multi-armed Bandits
auteur: Akram Baransi, Odalric-Ambrym Maillard, Shie Mannor
article: Proceedings of the European Conference on Machine Learning, 2014, pp.13
resume: The stochastic multi-armed bandit problem is a popular model of the exploration/exploitation trade-off in sequential decision problems. We introduce a novel algorithm that is based on sub-sampling. Despite its simplicity, we show that the algorithm demonstrates excellent empirical performances against state-of-the-art algorithms, including Thompson sampling and KL-UCB. The algorithm is very flexible, it does need to know a set of reward distributions in advance nor the range of the rewards. It is not restricted to Bernoulli distributions and is also invariant under rescaling of the rewards. We provide a detailed experimental study comparing the algorithm to the state of the art, the main intuition that explains the striking results, and conclude with a finite-time regret analysis for this algorithm in the simplified two-arm bandit setting.
Accès au texte intégral et bibtex

titre: Latent Bandits
auteur: Odalric-Ambrym Maillard, Shie Mannor
article: JFPDA, 2014, pp.05
resume: We consider a multi-armed bandit problem where the reward distributions are indexed by two sets -one for arms, one for type- and can be partitioned into a small number of clusters according to the type. First, we consider the setting where all reward distributions are known and all types have the same underlying cluster, the type's identity is, however, unknown. Second, we study the case where types may come from different classes, which is significantly more challenging. Finally, we tackle the case where the reward distributions are completely unknown. In each setting, we introduce specific algorithms and derive non-trivial regret performance. Numerical experiments show that, in the most challenging agnostic case, the proposed algorithm achieves excellent performance in difficult scenarios.
Accès au texte intégral et bibtex

Conference papers

titre: Selecting Near-Optimal Approximate State Representations in Reinforcement Learning
auteur: Ronald Ortner, Odalric-Ambrym Maillard, Daniil Ryabko
article: International Conference on Algorithmic Learning Theory (ALT), Oct 2014, Bled, Slovenia. pp.140-154
resume: We consider a reinforcement learning setting where the learner does not have explicit access to the states of the underlying Markov decision process (MDP). Instead, she has access to several models that map histories of past interactions to states. Here we improve over known regret bounds in this setting, and more importantly generalize to the case where the models given to the learner do not contain a true model resulting in an MDP representation but only approximations of it. We also give improved error bounds for state aggregation.
Accès au bibtex

Other publications

titre: Latent Bandits.
auteur: Odalric-Ambrym Maillard, Shie Mannor
article: 2014
resume: We consider a multi-armed bandit problem where the reward distributions are indexed by two sets --one for arms, one for type-- and can be partitioned into a small number of clusters according to the type. First, we consider the setting where all reward distributions are known and all types have the same underlying cluster, the type's identity is, however, unknown. Second, we study the case where types may come from different classes, which is significantly more challenging. Finally, we tackle the case where the reward distributions are completely unknown. In each setting, we introduce specific algorithms and derive non-trivial regret performance. Numerical experiments show that, in the most challenging agnostic case, the proposed algorithm achieves excellent performance in several difficult scenarios.
Accès au texte intégral et bibtex

2013

Journal articles

titre: Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation
auteur: Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, Gilles Stoltz
article: Annals of Statistics, 2013, 41 (3), pp.1516-1541. ⟨10.1214/13-AOS1119⟩
resume: We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins (1979), based on upper confidence bounds of the arm payoffs computed using the Kullback-Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: The kl-UCB algorithm is designed for one-parameter exponential families and the empirical KL-UCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas and Katehakis (1996), respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the state-of-the-art.
Accès au texte intégral et bibtex

Conference papers

titre: Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning
auteur: Odalric-Ambrym Maillard, Phuong Nguyen, Ronald Ortner, Daniil Ryabko
article: ICML - 30th International Conference on Machine Learning, 2013, Atlanta, USA, United States. pp.543-551
resume: We consider an agent interacting with an environment in a single stream of actions, observations, and rewards, with no reset. This process is not assumed to be a Markov Decision Process (MDP). Rather, the agent has several representations (mapping histories of past interactions to a discrete state space) of the environment with unknown dynamics, only some of which result in an MDP. The goal is to minimize the average regret criterion against an agent who knows an MDP representation giving the highest optimal reward, and acts optimally in it. Recent regret bounds for this setting are of order $O(T^{2/3})$ with an additive term constant yet exponential in some characteristics of the optimal MDP. We propose an algorithm whose regret after $T$ time steps is $O(\sqrt{T})$, with all constants reasonably small. This is optimal in $T$ since $O(\sqrt{T})$ is the optimal regret in the setting of learning in a (single discrete) MDP.
Accès au texte intégral et bibtex

titre: Competing with an Infinite Set of Models in Reinforcement Learning
auteur: Phuong Nguyen, Odalric-Ambrym Maillard, Daniil Ryabko, Ronald Ortner
article: AISTATS, 2013, Arizona, United States. pp.463-471
resume: We consider a reinforcement learning setting where the learner also has to deal with the problem of finding a suitable state-representation function from a given set of models. This has to be done while interacting with the environment in an online fashion (no resets), and the goal is to have small regret with respect to any Markov model in the set. For this setting, recently the \BLB~algorithm has been proposed, which achieves regret of order $T^{2/3}$, provided that the given set of models is finite. Our first contribution is to extend this result to a countably infinite set of models. Moreover, the \BLB~regret bound suffers from an additive term that can be exponential in the diameter of the MDP involved, since the diameter has to be guessed. The algorithm we propose avoids guessing the diameter, thus improving the regret bound.
Accès au bibtex

Other publications

titre: Robust Risk-averse Stochastic Multi-Armed Bandits
auteur: Odalric-Ambrym Maillard
article: 2013
resume: We study a variant of the standard stochastic multi-armed bandit problem when one is not interested in the arm with the best mean, but instead in the arm maximizing some coherent risk measure criterion. Further, we are studying the deviations of the regret instead of the less informative expected regret. We provide an algorithm, called RA-UCB to solve this problem, together with a high probability bound on its regret.
Accès au texte intégral et bibtex

2012

Preprints, Working Papers, ...

titre: Hierarchical Optimistic Region Selection driven by Curiosity
auteur: Odalric-Ambrym Maillard
article: 2012
resume: This paper aims to take a step forwards making the term ''intrinsic motivation'' from reinforcement learning theoretically well founded, focusing on curiosity-driven learning. To that end, we consider the setting where, a fixed partition \P of a continuous space \X being given, and a process \nu defined on \X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample \nu in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis.
Accès au texte intégral et bibtex

titre: Online allocation and homogeneous partitioning for piecewise constant mean-approximation
auteur: Odalric-Ambrym Maillard, Alexandra Carpentier
article: 2012
resume: In the setting of active learning for the multi-armed bandit, where the goal of a learner is to estimate with equal precision the mean of a finite number of arms, recent results show that it is possible to derive strategies based on finite-time confidence bounds that are competitive with the best possible strategy. We here consider an extension of this problem to the case when the arms are the cells of a finite partition P of a continuous sampling space X \subset \Real^d. Our goal is now to build a piecewise constant approximation of a noisy function (where each piece is one region of P and P is fixed beforehand) in order to maintain the local quadratic error of approximation on each cell equally low. Although this extension is not trivial, we show that a simple algorithm based on upper confidence bounds can be proved to be adaptive to the function itself in a near-optimal way, when |P| is chosen to be of minimax-optimal order on the class of \alpha-Hölder functions.
Accès au texte intégral et bibtex

2011

Conference papers

titre: Selecting the State-Representation in Reinforcement Learning
auteur: Odalric-Ambrym Maillard, Rémi Munos, Daniil Ryabko
article: Neural Information Processing Systems, Dec 2011, Granada, Spain
Accès au bibtex

titre: A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences
auteur: Odalric-Ambrym Maillard, Rémi Munos, Gilles Stoltz
article: 24th Annual Conference on Learning Theory : COLT'11, Jul 2011, Budapest, Hungary. pp.18
resume: We consider a Kullback-Leibler-based algorithm for the stochastic multi-armed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of \cite{Burnetas96}. Our contribution is to provide a finite-time analysis of this algorithm; we get bounds whose main terms are smaller than the ones of previously known algorithms with finite-time analyses (like UCB-type algorithms).
Accès au texte intégral et bibtex

Reports

titre: Adaptive Bandits: Towards the best history-dependent strategy
auteur: Odalric-Ambrym Maillard, Rémi Munos
article: [Technical Report] 2011, pp.14
resume: We consider multi-armed bandit games with possibly adaptive opponents. We introduce models Theta of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which dene two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes dened by some model theta*\in Theta. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in Theta. This allows to model opponents (case 1) or strategies (case 2) which handles nite memory, periodicity, standard stochastic bandits and other situations. When Theta={theta}, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time T) bounded by ~O(sqrt(TAC)), where C is the number of classes of theta. Now, when many models are available, all known algorithms achieving a nice regret O(sqrt(T)) are unfortunately not tractable and scale poorly with the number of models |Theta|. Our contribution here is to provide tractable algorithms with regret bounded by T^{2/3}C^{1/3} log(|Theta|)^{1/2}.
Accès au texte intégral et bibtex

Theses

titre: APPRENTISSAGE SÉQUENTIEL : Bandits, Statistique et Renforcement.
auteur: Odalric-Ambrym Maillard
article: Machine Learning [cs.LG]. Université des Sciences et Technologie de Lille - Lille I, 2011. English. ⟨NNT : ⟩
resume: This thesis studies the following topics in Machine Learning: Bandit theory, Statistical learning and Reinforcement learning. The common underlying thread is the non-asymptotic study of various notions of adaptation: to an environment or an opponent in part I about bandit theory, to the structure of a signal in part II about statistical theory, to the structure of states and rewards or to some state-model of the world in part III about reinforcement learning. First we derive a non-asymptotic analysis of a Kullback-Leibler-based algorithm for the stochastic multi-armed bandit that enables to match, in the case of distributions with finite support, the asymptotic distribution-dependent lower bound known for this problem. Now for a multi-armed bandit with a possibly adaptive opponent, we introduce history-based models to catch some weakness of the opponent, and show how one can benefit from such models to design algorithms adaptive to this weakness. Then we contribute to the regression setting and show how the use of random matrices can be beneficial both theoretically and numerically when the considered hypothesis space has a large, possibly infinite, dimension. We also use random matrices in the sparse recovery setting to build sensing operators that allow for recovery when the basis is far from being orthogonal. Finally we combine part I and II to first provide a non-asymptotic analysis of reinforcement learning algorithms such as Bellman-residual minimization and a version of Leastsquares temporal-difference that uses random projections and then, upstream of the Markov Decision Problem setting, discuss the practical problem of choosing a good model of states.
Accès au texte intégral et bibtex

2010

Conference papers

titre: Finite-Sample Analysis of Bellman Residual Minimization
auteur: Odalric-Ambrym Maillard, Rémi Munos, Alessandro Lazaric, Mohammad Ghavamzadeh
article: Asian Conference on Machine Learning, 2010, Japan
resume: We consider the Bellman residual minimization approach for solving discounted Markov decision problems, where we assume that a generative model of the dynamics and rewards is available. At each policy iteration step, an approximation of the value function for the current policy is obtained by minimizing an empirical Bellman residual defined on a set of n states drawn i.i.d. from a distribution, the immediate rewards, and the next states sampled from the model. Our main result is a generalization bound for the Bellman residual in linear approximation spaces. In particular, we prove that the empirical Bellman residual approaches the true (quadratic) Bellman residual with a rate of order O(1/sqrt((n)). This result implies that minimizing the empirical residual is indeed a sound approach for the minimization of the true Bellman residual which guarantees a good approximation of the value function for each policy. Finally, we derive performance bounds for the resulting approximate policy iteration algorithm in terms of the number of samples n and a measure of how well the function space is able to approximate the sequence of value functions.)
Accès au texte intégral et bibtex

Reports

titre: Linear regression with random projections
auteur: Odalric-Ambrym Maillard, Rémi Munos
article: [Technical Report] 2010, pp.22
resume: We consider ordinary (non penalized) least-squares regression where the regression function is chosen in a randomly generated sub-space GP \subset S of finite dimension P, where S is a function space of infinite dimension, e.g. L2([0, 1]^d). GP is defined as the span of P random features that are linear combinations of the basis functions of S weighted by random Gaussian i.i.d. coefficients. We characterize the so-called kernel space K \subset S of the resulting Gaussian process and derive approximation error bounds of order O(||f||^2_K log(P)/P) for functions f \in K approximated in GP . We apply this result to derive excess risk bounds for the least-squares estimate in various spaces. For illustration, we consider regression using the so-called scrambled wavelets (i.e. random linear combinations of wavelets of L2([0, 1]^d)) and derive an excess risk rate O(||f*||_K(logN)/sqrt(N)) which is arbitrarily close to the minimax optimal rate (up to a logarithmic factor) for target functions f* in K = H^s([0, 1]^d), a Sobolev space of smoothness order s > d/2. We describe an efficient implementation using lazy expansions with numerical complexity ˜O(2dN^3/2 logN+N^2), where d is the dimension of the input data and N is the number of data.
Accès au texte intégral et bibtex

titre: Brownian Motions and Scrambled Wavelets for Least-Squares Regression
auteur: Odalric-Ambrym Maillard, Rémi Munos
article: [Technical Report] 2010, pp.13
resume: We consider ordinary (non penalized) least-squares regression where the regression function is chosen in a randomly generated sub-space GP \subset S of finite dimension P, where S is a function space of infinite dimension, e.g. L2([0, 1]^d). GP is defined as the span of P random features that are linear combinations of the basis functions of S weighted by random Gaussian i.i.d. coefficients. We characterize the so-called kernel space K \subset S of the resulting Gaussian process and derive approximation error bounds of order O(||f||^2_K log(P)/P) for functions f \in K approximated in GP . We apply this result to derive excess risk bounds for the least-squares estimate in various spaces. For illustration, we consider regression using the so-called scrambled wavelets (i.e. random linear combinations of wavelets of L2([0, 1]^d)) and derive an excess risk rate O(||f*||_K(logN)/sqrt(N)) which is arbitrarily close to the minimax optimal rate (up to a logarithmic factor) for target functions f* in K = H^s([0, 1]^d), a Sobolev space of smoothness order s > d/2. We describe an efficient implementation using lazy expansions with numerical complexity ˜O(2dN^3/2 logN+N^5/2), where d is the dimension of the input data and N is the number of data.
Accès au texte intégral et bibtex

2009

Conference papers

titre: Compressed Least-Squares Regression
auteur: Odalric-Ambrym Maillard, Rémi Munos
article: NIPS 2009, Dec 2009, Vancouver, Canada
resume: We consider the problem of learning, from K data, a regression function in a linear space of high dimension N using projections onto a random subspace of lower dimension M. From any algorithm minimizing the (possibly penalized) empirical risk, we provide bounds on the excess risk of the estimate computed in the projected subspace (compressed domain) in terms of the excess risk of the estimate built in the high-dimensional space (initial domain). We show that solving the problem in the compressed domain instead of the initial domain reduces the estimation error at the price of an increased (but controlled) approximation error. We apply the analysis to Least-Squares (LS) regression and discuss the excess risk and numerical complexity of the resulting ``Compressed Least Squares Regression'' (CLSR) in terms of N, K, and M. When we choose M=O(\sqrt{K}), we show that CLSR has an estimation error of order O(\log K / \sqrt{K}).
Accès au texte intégral et bibtex