Statistical Learning
Achieving greater Explanatory Power and Forecasting Accuracy with Non-uniform spread Fuzzy Linear Regression
Fuzzy regression models have been applied to several Operations Research applications viz., forecasting and prediction. Earlier works on fuzzy regression analysis obtain crisp regression coefficients for eliminating the problem of increasing spreads for the estimated fuzzy responses as the magnitude of the independent variable increases. But they cannot deal with the problem of non-uniform spreads. In this work, a three-phase approach is discussed to construct the fuzzy regression model with non-uniform spreads to deal with this problem. The first phase constructs the membership functions of the least-squares estimates of regression coefficients based on extension principle to completely conserve the fuzziness of observations. They are then defuzzified by the centre of area method to obtain crisp regression coefficients in the second phase. Finally, the error terms of the method are determined by setting each estimated spread equal to its corresponding observed spread. The Tagaki-Sugeno inference system is used for improving the accuracy of forecasts. The simulation example demonstrates the strength of fuzzy linear regression model in terms of higher explanatory power and forecasting performance.
Active Stratified Sampling with Clustering-Based Type Systems for Predicting the Search Tree Size of Problems with Real-Valued Heuristics
Lelis, Levi H. S. (University of Alberta)
In this paper we advance the line of research launched by Knuth which was later improved by Chen for predicting the size of the search tree expanded by heuristic search algorithms such as IDA*. Chen's Stratified Sampling (SS) uses a partition of the nodes in the search tree called type system to guide its sampling. Recent work has shown that SS using type systems based on integer-valued heuristic functions can be quite effective. However, type systems based on real-valued heuristic functions are often too large to be practical. We use the k-means clustering algorithm for creating effective type systems for domains with real-valued heuristics. Orthogonal to the type systems, another contribution of this paper is the introduction of an algorithm called Active SS. SS allocates the same number of samples for each type. Active SS is the application of the idea of active sampling to search trees. Active SS allocates more samples to the types with higher uncertainty. Our empirical results show that (i) SS using clustering-based type systems tends to produce better predictions than competing schemes that do not use a type system, and that (ii) Active SS can produce better predictions than the regular version of SS.
Parallelising the k-Medoids Clustering Problem Using Space-Partitioning
Arbelaez, Alejandro (JFLI / University of Tokyo) | Quesada, Luis (University College Cork)
The k-medoids problem is a combinatorial optimisation problem with multiples applications in Resource Allocation, Mobile Computing, Sensor Networks and Telecommunications.Real instances of this problem involve hundreds of thousands of points and thousands of medoids.Despite the proliferation of parallel architectures, this problem has been mostly tackled using sequential approaches.In this paper, we study the impact of space-partitioning techniques on the performance of parallel local search algorithms to tackle the k-medoids clustering problem, and compare these results with the ones obtained using sampling.Our experiments suggest that approaches relying on partitioning scale more while preserving the quality of the solution.
Personality Traits Recognition on Social Network - Facebook
Alam, Firoj (University of Trento) | Stepanov, Evgeny A. (University of Trento) | Riccardi, Giuseppe (University of Trento)
For the natural and social interaction it is necessary to understand human behavior. Personality is one of the fundamental aspects, by which we can understand behavioral dispositions. It is evident that there is a strong correlation between usersโ personality and the way they behave on online social network (e.g., Facebook). This paper presents automatic recognition of Big-5 personality traits on social network (Facebook) using usersโ status text. For the automatic recognition we studied different classification methods such as SMO (Sequential Minimal Optimization for Support Vector Machine), Bayesian Logistic Regression (BLR) and Multinomial Naรฏve Bayes (MNB) sparse modeling. Performance of the systems had been measured using macro-averaged precision, recall and F1; weighted average accuracy (WA) and un-weighted average accuracy (UA). Our comparative study shows that MNB performs better than BLR and SMO for personality traits recognition on the social network data.
Stochastic Optimization of PCA with Capped MSG
Arora, Raman, Cotter, Andrew, Srebro, Nathan
Principal Component Analysis (PCA) is a ubiquitous tool used in many data analysis, machine learning and information retrieval applications. It is used for obtaining a lower dimensional representation of a high dimensional signal that still captures as much as possible of the original signal. Such a low dimensional representation can be useful for reducing storage and computational costs, as complexity control in learning systems, or to aid in visualization.
Greedy Feature Selection for Subspace Clustering
Dyer, Eva L., Sankaranarayanan, Aswin C., Baraniuk, Richard G.
Unions of subspaces provide a powerful generalization to linear subspace models for collections of high-dimensional data. To learn a union of subspaces from a collection of data, sets of signals in the collection that belong to the same subspace must be identified in order to obtain accurate estimates of the subspace structures present in the data. Recently, sparse recovery methods have been shown to provide a provable and robust strategy for exact feature selection (EFS)--recovering subsets of points from the ensemble that live in the same subspace. In parallel with recent studies of EFS with L1-minimization, in this paper, we develop sufficient conditions for EFS with a greedy method for sparse signal recovery known as orthogonal matching pursuit (OMP). Following our analysis, we provide an empirical study of feature selection strategies for signals living on unions of subspaces and characterize the gap between sparse recovery methods and nearest neighbor (NN)-based approaches. In particular, we demonstrate that sparse recovery methods provide significant advantages over NN methods and the gap between the two approaches is particularly pronounced when the sampling of subspaces in the dataset is sparse. Our results suggest that OMP may be employed to reliably recover exact feature sets in a number of regimes where NN approaches fail to reveal the subspace membership of points in the ensemble.
AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods
Freund, Robert M., Grigas, Paul, Mazumder, Rahul
Boosting methods are highly popular and effective supervised learning methods which combine weak learners into a single accurate model with good statistical performance. In this paper, we analyze two well-known boosting methods, AdaBoost and Incremental Forward Stagewise Regression (FS$_\varepsilon$), by establishing their precise connections to the Mirror Descent algorithm, which is a first-order method in convex optimization. As a consequence of these connections we obtain novel computational guarantees for these boosting methods. In particular, we characterize convergence bounds of AdaBoost, related to both the margin and log-exponential loss function, for any step-size sequence. Furthermore, this paper presents, for the first time, precise computational complexity results for FS$_\varepsilon$.
An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework
In order to cluster or partition data, we often use Expectation-and-Maximization (EM) or Variational approximation with a Gaussian Mixture Model (GMM), which is a parametric probability density function represented as a weighted sum of $\hat{K}$ Gaussian component densities. However, model selection to find underlying $\hat{K}$ is one of the key concerns in GMM clustering, since we can obtain the desired clusters only when $\hat{K}$ is known. In this paper, we propose a new model selection algorithm to explore $\hat{K}$ in a Bayesian framework. The proposed algorithm builds the density of the model order which any information criterions such as AIC and BIC basically fail to reconstruct. In addition, this algorithm reconstructs the density quickly as compared to the time-consuming Monte Carlo simulation.
Learning Mixed Graphical Models
Lee, Jason D., Hastie, Trevor J.
We consider the problem of learning the structure of a pairwise graphical model over continuous and discrete variables. We present a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning. In previous work, authors have considered structure learning of Gaussian graphical models and structure learning of discrete models. Our approach is a natural generalization of these two lines of work to the mixed case. The penalization scheme involves a novel symmetric use of the group-lasso norm and follows naturally from a particular parametrization of the model.
Semi-supervised Ranking Pursuit
Tsivtsivadze, Evgeni, Heskes, Tom
We propose a novel sparse preference learning/ranking algorithm. Our algorithm approximates the true utility function by a weighted sum of basis functions using the squared loss on pairs of data points, and is a generalization of the kernel matching pursuit method. It can operate both in a supervised and a semi-supervised setting and allows efficient search for multiple, near-optimal solutions. Furthermore, we describe the extension of the algorithm suitable for combined ranking and regression tasks. In our experiments we demonstrate that the proposed algorithm outperforms several state-of-the-art learning methods when taking into account unlabeled data and performs comparably in a supervised learning scenario, while providing sparser solutions.