Not enough data to create a plot.
Try a different view from the menu above.
Sugiyama, Masashi
The Survival Bandit Problem
Riou, Charles, Honda, Junya, Sugiyama, Masashi
We introduce and study a new variant of the multi-armed bandit problem (MAB), called the survival bandit problem (S-MAB). While in both problems, the objective is to maximize the so-called cumulative reward, in this new variant, the procedure is interrupted if the cumulative reward falls below a preset threshold. This simple yet unexplored extension of the MAB follows from many practical applications. For example, when testing two medicines against each other on voluntary patients, people's health are at stake, and it is necessary to be able to interrupt experiments if serious side effects occur or if the disease syndromes are not dissipated by the treatment. From a theoretical perspective, the S-MAB is the first variant of the MAB where the procedure may or may not be interrupted. We start by formalizing the S-MAB and we define its objective as the minimization of the so-called survival regret, which naturally generalizes the regret of the MAB. Then, we show that the objective of the S-MAB is considerably more difficult than the MAB, in the sense that contrary to the MAB, no policy can achieve a reasonably small (i.e., sublinear) survival regret. Instead, we minimize the survival regret in the sense of Pareto, i.e., we seek a policy whose cumulative reward cannot be improved for some problem instance without being sacrificed for another one. For that purpose, we identify two key components in the survival regret: the regret given no ruin (which corresponds to the regret in the MAB), and the probability that the procedure is interrupted, called the probability of ruin. We derive a lower bound on the probability of ruin, as well as policies whose probability of ruin matches the lower bound. Finally, based on a doubling trick on those policies, we derive a policy which minimizes the survival regret in the sense of Pareto, giving an answer to an open problem by Perotto et al. (COLT 2019).
An Optimal Algorithm for the Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit
Nakamura, Shintaro, Sugiyama, Masashi
We study the real-valued combinatorial pure exploration problem in the stochastic multi-armed bandit (R-CPE-MAB). We study the case where the size of the action set is polynomial with respect to the number of arms. In such a case, the R-CPE-MAB can be seen as a special case of the so-called transductive linear bandits. Existing methods in the R-CPE-MAB and transductive linear bandits have a gap of problem-dependent constant terms and logarithmic terms between the upper and lower bounds of the sample complexity, respectively. We close these gaps by proposing an algorithm named the combinatorial gap-based exploration (CombGapE) algorithm, whose sample complexity upper bound matches the lower bound. Finally, we numerically show that the CombGapE algorithm outperforms existing methods significantly.
The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models
Lee, Jongyeong, Chiang, Chao-Kai, Sugiyama, Masashi
Thompson sampling (TS) has been known for its outstanding empirical performance supported by theoretical guarantees across various reward models in the classical stochastic multi-armed bandit problems. Nonetheless, its optimality is often restricted to specific priors due to the common observation that TS is fairly insensitive to the choice of the prior when it comes to asymptotic regret bounds. However, when the model contains multiple parameters, the optimality of TS highly depends on the choice of priors, which casts doubt on the generalizability of previous findings to other models. To address this gap, this study explores the impact of selecting noninformative priors, offering insights into the performance of TS when dealing with new models that lack theoretical understanding. We first extend the regret analysis of TS to the model of uniform distributions with unknown supports, which would be the simplest non-regular model. Our findings reveal that changing noninformative priors can significantly affect the expected regret, aligning with previously known results in other multiparameter bandit models. Although the uniform prior is shown to be optimal, we highlight the inherent limitation of its optimality, which is limited to specific parameterizations and emphasizes the significance of the invariance property of priors. In light of this limitation, we propose a slightly modified TS-based policy, called TS with Truncation (TS-T), which can achieve the asymptotic optimality for the Gaussian models and the uniform models by using the reference prior and the Jeffreys prior that are invariant under one-to-one reparameterizations. This policy provides an alternative approach to achieving optimality by employing fine-tuned truncation, which would be much easier than hunting for optimal priors in practice.
Learning with Complementary Labels Revisited: A Consistent Approach via Negative-Unlabeled Learning
Wang, Wei, Ishida, Takashi, Zhang, Yu-Jie, Niu, Gang, Sugiyama, Masashi
Deep learning and its applications have achieved great success in recent years. However, to achieve good performance, large amounts of training data with accurate labels are required, which may not be satisfied in some real-world scenarios. Due to the effectiveness in reducing the cost and effort of labeling while maintaining comparable performance, various weakly supervised learning problems have been investigated in recent years, including semi-supervised learning [Berthelot et al., 2019], noisy-label learning [Patrini et al., 2017], programmatic weak supervision [Zhang et al., 2021a], positive-unlabeled learning [Bekker and Davis, 2020], similarity-based classification [Hsu et al., 2019], and partial-label learning [Wang et al., 2022]. Complementary-label learning is another weakly supervised learning problem that has received a lot of attention recently [Ishida et al., 2017]. In complementary-label learning, we are given training data associated with complementary labels that specify the classes to which the examples do not belong. The task is to learn a multi-class classifier that assigns correct labels to ordinary-label testing data. Collecting training data with complementary labels is much easier and cheaper than collecting ordinary-label data. For example, when asking workers on crowdsourcing platforms to annotate training data, we only need to randomly select a candidate label and then ask them whether the example belongs to that class or not.
Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit
Nakamura, Shintaro, Sugiyama, Masashi
We study the real-valued combinatorial pure exploration of the multi-armed bandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given $d$ stochastic arms, and the reward of each arm $s\in\{1, \ldots, d\}$ follows an unknown distribution with mean $\mu_s$. In each time step, a player pulls a single arm and observes its reward. The player's goal is to identify the optimal \emph{action} $\boldsymbol{\pi}^{*} = \argmax_{\boldsymbol{\pi} \in \mathcal{A}} \boldsymbol{\mu}^{\top}\boldsymbol{\pi}$ from a finite-sized real-valued \emph{action set} $\mathcal{A}\subset \mathbb{R}^{d}$ with as few arm pulls as possible. Previous methods in the R-CPE-MAB assume that the size of the action set $\mathcal{A}$ is polynomial in $d$. We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large in $d$. We also introduce a novel problem-dependent sample complexity lower bound of the R-CPE-MAB problem, and show that the GenTS-Explore algorithm achieves the optimal sample complexity up to a problem-dependent constant factor.
Fixed-Budget Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit
Nakamura, Shintaro, Sugiyama, Masashi
We study the real-valued combinatorial pure exploration of the multi-armed bandit in the fixed-budget setting. We first introduce the Combinatorial Successive Asign (CSA) algorithm, which is the first algorithm that can identify the best action even when the size of the action class is exponentially large with respect to the number of arms. We show that the upper bound of the probability of error of the CSA algorithm matches a lower bound up to a logarithmic factor in the exponent. Then, we introduce another algorithm named the Minimax Combinatorial Successive Accepts and Rejects (Minimax-CombSAR) algorithm for the case where the size of the action class is polynomial, and show that it is optimal, which matches a lower bound. Finally, we experimentally compare the algorithms with previous methods and show that our algorithm performs better.
Generalizing Importance Weighting to A Universal Solver for Distribution Shift Problems
Fang, Tongtong, Lu, Nan, Niu, Gang, Sugiyama, Masashi
Distribution shift (DS) may have two levels: the distribution itself changes, and the support (i.e., the set where the probability density is non-zero) also changes. When considering the support change between the training and test distributions, there can be four cases: (i) they exactly match; (ii) the training support is wider (and thus covers the test support); (iii) the test support is wider; (iv) they partially overlap. Existing methods are good at cases (i) and (ii), while cases (iii) and (iv) are more common nowadays but still under-explored. In this paper, we generalize importance weighting (IW), a golden solver for cases (i) and (ii), to a universal solver for all cases. Specifically, we first investigate why IW might fail in cases (iii) and (iv); based on the findings, we propose generalized IW (GIW) that could handle cases (iii) and (iv) and would reduce to IW in cases (i) and (ii). In GIW, the test support is split into an in-training (IT) part and an out-of-training (OOT) part, and the expected risk is decomposed into a weighted classification term over the IT part and a standard classification term over the OOT part, which guarantees the risk consistency of GIW. Then, the implementation of GIW consists of three components: (a) the split of validation data is carried out by the one-class support vector machine, (b) the first term of the empirical risk can be handled by any IW algorithm given training data and IT validation data, and (c) the second term just involves OOT validation data. Experiments demonstrate that GIW is a universal solver for DS problems, outperforming IW methods in cases (iii) and (iv).
Diversified Outlier Exposure for Out-of-Distribution Detection via Informative Extrapolation
Zhu, Jianing, Yu, Geng, Yao, Jiangchao, Liu, Tongliang, Niu, Gang, Sugiyama, Masashi, Han, Bo
Out-of-distribution (OOD) detection is important for deploying reliable machine learning models on real-world applications. Recent advances in outlier exposure have shown promising results on OOD detection via fine-tuning model with informatively sampled auxiliary outliers. However, previous methods assume that the collected outliers can be sufficiently large and representative to cover the boundary between ID and OOD data, which might be impractical and challenging. In this work, we propose a novel framework, namely, Diversified Outlier Exposure (DivOE), for effective OOD detection via informative extrapolation based on the given auxiliary outliers. Specifically, DivOE introduces a new learning objective, which diversifies the auxiliary distribution by explicitly synthesizing more informative outliers for extrapolation during training. It leverages a multi-step optimization method to generate novel outliers beyond the original ones, which is compatible with many variants of outlier exposure. Extensive experiments and analyses have been conducted to characterize and demonstrate the effectiveness of the proposed DivOE.
Enhancing Adversarial Contrastive Learning via Adversarial Invariant Regularization
Xu, Xilie, Zhang, Jingfeng, Liu, Feng, Sugiyama, Masashi, Kankanhalli, Mohan
Adversarial contrastive learning (ACL) is a technique that enhances standard contrastive learning (SCL) by incorporating adversarial data to learn a robust representation that can withstand adversarial attacks and common corruptions without requiring costly annotations. To improve transferability, the existing work introduced the standard invariant regularization (SIR) to impose style-independence property to SCL, which can exempt the impact of nuisance style factors in the standard representation. However, it is unclear how the style-independence property benefits ACL-learned robust representations. In this paper, we leverage the technique of causal reasoning to interpret the ACL and propose adversarial invariant regularization (AIR) to enforce independence from style factors. We regulate the ACL using both SIR and AIR to output the robust representation. Theoretically, we show that AIR implicitly encourages the representational distance between different views of natural data and their adversarial variants to be independent of style factors. Empirically, our experimental results show that invariant regularization significantly improves the performance of state-of-the-art ACL methods in terms of both standard generalization and robustness on downstream tasks. To the best of our knowledge, we are the first to apply causal reasoning to interpret ACL and develop AIR for enhancing ACL-learned robust representations.
Binary Classification with Confidence Difference
Wang, Wei, Feng, Lei, Jiang, Yuchen, Niu, Gang, Zhang, Min-Ling, Sugiyama, Masashi
Recently, learning with soft labels has been shown to achieve better performance than learning with hard labels in terms of model generalization, calibration, and robustness. However, collecting pointwise labeling confidence for all training examples can be challenging and time-consuming in real-world scenarios. This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification. Instead of pointwise labeling confidence, we are given only unlabeled data pairs with confidence difference that specifies the difference in the probabilities of being positive. We propose a risk-consistent approach to tackle this problem and show that the estimation error bound achieves the optimal convergence rate. We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven. Extensive experiments on benchmark data sets and a real-world recommender system data set validate the effectiveness of our proposed approaches in exploiting the supervision information of the confidence difference.