Asia
Theoretical and Experimental Analyses of Tensor-Based Regression and Classification
Wimalawarne, Kishan, Tomioka, Ryota, Sugiyama, Masashi
We theoretically and experimentally investigate tensor-based regression and classification. Our focus is regularization with various tensor norms, including the overlapped trace norm, the latent trace norm, and the scaled latent trace norm. We first give dual optimization methods using the alternating direction method of multipliers, which is computationally efficient when the number of training samples is moderate. We then theoretically derive an excess risk bound for each tensor norm and clarify their behavior. Finally, we perform extensive experiments using simulated and real data and demonstrate the superiority of tensor-based learning methods over vector- and matrix-based learning methods.
Regret vs. Communication: Distributed Stochastic Multi-Armed Bandits and Beyond
Liu, Shuang, Chen, Cheng, Zhang, Zhihua
In this paper, we consider the distributed stochastic multi-armed bandit problem, where a global arm set can be accessed by multiple players independently. The players are allowed to exchange their history of observations with each other at specific points in time. We study the relationship between regret and communication. When the time horizon is known, we propose the Over-Exploration strategy, which only requires one-round communication and whose regret does not scale with the number of players. When the time horizon is unknown, we measure the frequency of communication through a new notion called the density of the communication set, and give an exact characterization of the interplay between regret and communication. Specifically, a lower bound is established and stable strategies that match the lower bound are developed. The results and analyses in this paper are specific but can be translated into more general settings.
Fast rates in statistical and online learning
van Erven, Tim, Grünwald, Peter D., Mehta, Nishant A., Reid, Mark D., Williamson, Robert C.
The speed with which a learning algorithm converges as it is presented with more data is a central problem in machine learning --- a fast rate of convergence means less data is needed for the same level of performance. The pursuit of fast rates in online and statistical learning has led to the discovery of many conditions in learning theory under which fast learning is possible. We show that most of these conditions are special cases of a single, unifying condition, that comes in two forms: the central condition for 'proper' learning algorithms that always output a hypothesis in the given model, and stochastic mixability for online algorithms that may make predictions outside of the model. We show that under surprisingly weak assumptions both conditions are, in a certain sense, equivalent. The central condition has a re-interpretation in terms of convexity of a set of pseudoprobabilities, linking it to density estimation under misspecification. For bounded losses, we show how the central condition enables a direct proof of fast rates and we prove its equivalence to the Bernstein condition, itself a generalization of the Tsybakov margin condition, both of which have played a central role in obtaining fast rates in statistical learning. Yet, while the Bernstein condition is two-sided, the central condition is one-sided, making it more suitable to deal with unbounded losses. In its stochastic mixability form, our condition generalizes both a stochastic exp-concavity condition identified by Juditsky, Rigollet and Tsybakov and Vovk's notion of mixability. Our unifying conditions thus provide a substantial step towards a characterization of fast rates in statistical learning, similar to how classical mixability characterizes constant regret in the sequential prediction with expert advice setting.
Dictionary Learning for Blind One Bit Compressed Sensing
Zayyani, Hadi, Korki, Mehdi, Marvasti, Farrokh
This letter proposes a dictionary learning algorithm for blind one bit compressed sensing. In the blind one bit compressed sensing framework, the original signal to be reconstructed from one bit linear random measurements is sparse in an unknown domain. In this context, the multiplication of measurement matrix $\Ab$ and sparse domain matrix $\Phi$, \ie $\Db=\Ab\Phi$, should be learned. Hence, we use dictionary learning to train this matrix. Towards that end, an appropriate continuous convex cost function is suggested for one bit compressed sensing and a simple steepest-descent method is exploited to learn the rows of the matrix $\Db$. Experimental results show the effectiveness of the proposed algorithm against the case of no dictionary learning, specially with increasing the number of training signals and the number of sign measurements.
A Review of Nonnegative Matrix Factorization Methods for Clustering
Clustering, the problem of partitioning observations with high intragroup similarity, has always been one of the central themes in unsupervised learning. Although a wide variety of algorithms for clustering applications have been studied in literature, the subject still remains an active avenue for research. In fact, it is possible to formulate clustering as a matrix decomposition problem. This formulation leads to interesting interpretations as well as novel algorithms that benefit from favorable computational properties of numerical linear algebra. Popularized by Lee and Seung [11], Nonnegative Matrix Factorization (NMF) has turned into one of the primary tools for decomposing data sets into low-rank factorizing matrices in order to yield a parts-based representation.
Nested Hierarchical Dirichlet Processes for Multi-Level Non-Parametric Admixture Modeling
Tekumalla, Lavanya Sita, Agrawal, Priyanka, Bhattacharya, Indrajit
Dirichlet Process(DP) is a Bayesian non-parametric prior for infinite mixture modeling, where the number of mixture components grows with the number of data items. The Hierarchical Dirichlet Process (HDP), is an extension of DP for grouped data, often used for non-parametric topic modeling, where each group is a mixture over shared mixture densities. The Nested Dirichlet Process (nDP), on the other hand, is an extension of the DP for learning group level distributions from data, simultaneously clustering the groups. It allows group level distributions to be shared across groups in a non-parametric setting, leading to a non-parametric mixture of mixtures. The nCRF extends the nDP for multilevel non-parametric mixture modeling, enabling modeling topic hierarchies. However, the nDP and nCRF do not allow sharing of distributions as required in many applications, motivating the need for multi-level non-parametric admixture modeling. We address this gap by proposing multi-level nested HDPs (nHDP) where the base distribution of the HDP is itself a HDP at each level thereby leading to admixtures of admixtures at each level. Because of couplings between various HDP levels, scaling up is naturally a challenge during inference. We propose a multi-level nested Chinese Restaurant Franchise (nCRF) representation for the nested HDP, with which we outline an inference algorithm based on Gibbs Sampling. We evaluate our model with the two level nHDP for non-parametric entity topic modeling where an inner HDP creates a countably infinite set of topic mixtures and associates them with author entities, while an outer HDP associates documents with these author entities. In our experiments on two real world research corpora, the nHDP is able to generalize significantly better than existing models and detect missing author entities with a reasonable level of accuracy.
Inferring Passenger Type from Commuter Eigentravel Matrices
Legara, Erika Fille, Monterola, Christopher
A sufficient knowledge of the demographics of a commuting public is essential in formulating and implementing more targeted transportation policies, as commuters exhibit different ways of traveling. With the advent of the Automated Fare Collection system (AFC), probing the travel patterns of commuters has become less invasive and more accessible. Consequently, numerous transport studies related to human mobility have shown that these observed patterns allow one to pair individuals with locations and/or activities at certain times of the day. However, classifying commuters using their travel signatures is yet to be thoroughly examined. Here, we contribute to the literature by demonstrating a procedure to characterize passenger types (Adult, Child/Student, and Senior Citizen) based on their three-month travel patterns taken from a smart fare card system. We first establish a method to construct distinct commuter matrices, which we refer to as eigentravel matrices, that capture the characteristic travel routines of individuals. From the eigentravel matrices, we build classification models that predict the type of passengers traveling. Among the models explored, the gradient boosting method (GBM) gives the best prediction accuracy at 76%, which is 84% better than the minimum model accuracy (41%) required vis-\`a-vis the proportional chance criterion. In addition, we find that travel features generated during weekdays have greater predictive power than those on weekends. This work should not only be useful for transport planners, but for market researchers as well. With the awareness of which commuter types are traveling, ads, service announcements, and surveys, among others, can be made more targeted spatiotemporally. Finally, our framework should be effective in creating synthetic populations for use in real-world simulations that involve a metropolitan's public transport system.
Convex Calibration Dimension for Multiclass Loss Matrices
Ramaswamy, Harish G., Agarwal, Shivani
We study consistency properties of surrogate loss functions for general multiclass learning problems, defined by a general multiclass loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be calibrated with respect to a loss matrix in this setting. We then introduce the notion of convex calibration dimension of a multiclass loss matrix, which measures the smallest'size' of a prediction space in which it is possible to design a convex surrogate that is calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, we apply our framework to study various subset ranking losses, and use the convex calibration dimension as a tool to show both the existence and nonexistence of various types of convex calibrated surrogates for these losses. Our results strengthen recent results of Duchi et al. (2010) and Calauzènes et al. (2012) on the nonexistence of certain types of convex calibrated surrogates in subset ranking. We anticipate the convex calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. Keywords: Statistical consistency, multiclass loss, loss matrix, surrogate loss, convex surrogates, calibrated surrogates, classification calibration, subset ranking.
Robust Subspace Clustering via Thresholding
Heckel, Reinhard, Bölcskei, Helmut
The problem of clustering noisy and incompletely observed high-dimensional data points into a union of low-dimensional subspaces and a set of outliers is considered. The number of subspaces, their dimensions, and their orientations are assumed unknown. We propose a simple low-complexity subspace clustering algorithm, which applies spectral clustering to an adjacency matrix obtained by thresholding the correlations between data points. In other words, the adjacency matrix is constructed from the nearest neighbors of each data point in spherical distance. A statistical performance analysis shows that the algorithm exhibits robustness to additive noise and succeeds even when the subspaces intersect. Specifically, our results reveal an explicit tradeoff between the affinity of the subspaces and the tolerable noise level. We furthermore prove that the algorithm succeeds even when the data points are incompletely observed with the number of missing entries allowed to be (up to a log-factor) linear in the ambient dimension. We also propose a simple scheme that provably detects outliers, and we present numerical results on real and synthetic data.
Mining Brain Networks using Multiple Side Views for Neurological Disorder Identification
Cao, Bokai, Kong, Xiangnan, Zhang, Jingyuan, Yu, Philip S., Ragin, Ann B.
Mining discriminative subgraph patterns from graph data has attracted great interest in recent years. It has a wide variety of applications in disease diagnosis, neuroimaging, etc. Most research on subgraph mining focuses on the graph representation alone. However, in many real-world applications, the side information is available along with the graph data. For example, for neurological disorder identification, in addition to the brain networks derived from neuroimaging data, hundreds of clinical, immunologic, serologic and cognitive measures may also be documented for each subject. These measures compose multiple side views encoding a tremendous amount of supplemental information for diagnostic purposes, yet are often ignored. In this paper, we study the problem of discriminative subgraph selection using multiple side views and propose a novel solution to find an optimal set of subgraph features for graph classification by exploring a plurality of side views. We derive a feature evaluation criterion, named gSide, to estimate the usefulness of subgraph patterns based upon side views. Then we develop a branch-and-bound algorithm, called gMSV, to efficiently search for optimal subgraph features by integrating the subgraph mining process and the procedure of discriminative feature selection. Empirical studies on graph classification tasks for neurological disorders using brain networks demonstrate that subgraph patterns selected by the multi-side-view guided subgraph selection approach can effectively boost graph classification performances and are relevant to disease diagnosis.