Statistical Learning
An improved quasar detection method in EROS-2 and MACHO LMC datasets
Pichara, Karim, Protopapas, Pavlos, Kim, Dae-Won, Marquette, Jean-Baptiste, Tisserand, Patrick
We present a new classification method for quasar identification in the EROS-2 and MACHO datasets based on a boosted version of Random Forest classifier. We use a set of variability features including parameters of a continuous auto regressive model. We prove that continuous auto regressive parameters are very important discriminators in the classification process. We create two training sets (one for EROS-2 and one for MACHO datasets) using known quasars found in the LMC. Our model's accuracy in both EROS-2 and MACHO training sets is about 90% precision and 86% recall, improving the state of the art models accuracy in quasar detection. We apply the model on the complete, including 28 million objects, EROS-2 and MACHO LMC datasets, finding 1160 and 2551 candidates respectively. To further validate our list of candidates, we crossmatched our list with a previous 663 known strong candidates, getting 74% of matches for MACHO and 40% in EROS-2. The main difference on matching level is because EROS-2 is a slightly shallower survey which translates to significantly lower signal-to-noise ratio lightcurves.
On Power-law Kernels, corresponding Reproducing Kernel Hilbert Space and Applications
Ghoshdastidar, Debarghya, Dukkipati, Ambedkar
Abstract--The role of kernels is central to machine learning. Motivated by the importance of power-law distributions in statistical modeling, in this paper, we propose the notion of power-law kernels to investigate power-laws in learning problem. We propose two power-law kernels by generalizing Gaussian and Laplacian kernels. This generalization is based on distributions, arising out of maximization of a generalized information measure known as nonextensive entropy that is very well studied in statistical mechanics. We prove that the proposed kernels are positive definite, and provide some insights regarding the corresponding Reproducing Kernel Hilbert Space (RKHS). We also study practical significance of both kernels in classification and regression, and present some simulation results.
Sparse Projections of Medical Images onto Manifolds
Chen, George H., Wachinger, Christian, Golland, Polina
Manifold learning has been successfully applied to a variety of medical imaging problems. Its use in real-time applications requires fast projection onto the low-dimensional space. To this end, out-of-sample extensions are applied by constructing an interpolation function that maps from the input space to the low-dimensional manifold. Commonly used approaches such as the Nystr\"{o}m extension and kernel ridge regression require using all training points. We propose an interpolation function that only depends on a small subset of the input training data. Consequently, in the testing phase each new point only needs to be compared against a small number of input training data in order to project the point onto the low-dimensional space. We interpret our method as an out-of-sample extension that approximates kernel ridge regression. Our method involves solving a simple convex optimization problem and has the attractive property of guaranteeing an upper bound on the approximation error, which is crucial for medical applications. Tuning this error bound controls the sparsity of the resulting interpolation function. We illustrate our method in two clinical applications that require fast mapping of input images onto a low-dimensional space.
Detecting Overlapping Temporal Community Structure in Time-Evolving Networks
Chen, Yudong, Kawadia, Vikas, Urgaonkar, Rahul
We present a principled approach for detecting overlapping temporal community structure in dynamic networks. Our method is based on the following framework: find the overlapping temporal community structure that maximizes a quality function associated with each snapshot of the network subject to a temporal smoothness constraint. A novel quality function and a smoothness constraint are proposed to handle overlaps, and a new convex relaxation is used to solve the resulting combinatorial optimization problem. We provide theoretical guarantees as well as experimental results that reveal community structure in real and synthetic networks. Our main insight is that certain structures can be identified only when temporal correlation is considered and when communities are allowed to overlap. In general, discovering such overlapping temporal community structure can enhance our understanding of real-world complex networks by revealing the underlying stability behind their seemingly chaotic evolution.
Scalable Text and Link Analysis with Mixed-Topic Link Models
Zhu, Yaojia, Yan, Xiaoran, Getoor, Lise, Moore, Cristopher
Many data sets contain rich information about objects, as well as pairwise relations between them. For instance, in networks of websites, scientific papers, and other documents, each node has content consisting of a collection of words, as well as hyperlinks or citations to other nodes. In order to perform inference on such data sets, and make predictions and recommendations, it is useful to have models that are able to capture the processes which generate the text at each node and the links between them. In this paper, we combine classic ideas in topic modeling with a variant of the mixed-membership block model recently developed in the statistical physics community. The resulting model has the advantage that its parameters, including the mixture of topics of each document and the resulting overlapping communities, can be inferred with a simple and scalable expectation-maximization algorithm. We test our model on three data sets, performing unsupervised topic classification and link prediction. For both tasks, our model outperforms several existing state-of-the-art methods, achieving higher accuracy with significantly less computation, analyzing a data set with 1.3 million words and 44 thousand links in a few minutes.
Efficiently Using Second Order Information in Large l1 Regularization Problems
Tang, Xiaocheng, Scheinberg, Katya
We propose a novel general algorithm LHAC that efficiently uses second-order information to train a class of large-scale l1-regularized problems. Our method executes cheap iterations while achieving fast local convergence rate by exploiting the special structure of a low-rank matrix, constructed via quasi-Newton approximation of the Hessian of the smooth loss function. A greedy active-set strategy, based on the largest violations in the dual constraints, is employed to maintain a working set that iteratively estimates the complement of the optimal active set. This allows for smaller size of subproblems and eventually identifies the optimal active set. Empirical comparisons confirm that LHAC is highly competitive with several recently proposed state-of-the-art specialized solvers for sparse logistic regression and sparse inverse covariance matrix selection.
Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients
Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing learning rates over time on stationary problems, and permitting learning rates to grow appropriately in non-stationary tasks. Here, we extend the idea in three directions, addressing proper minibatch parallelization, including reweighted updates for sparse or orthogonal gradients, improving robustness on non-smooth loss functions, in the process replacing the diagonal Hessian estimation procedure that may not always be available by a robust finite-difference approximation. The final algorithm integrates all these components, has linear complexity and is hyper-parameter free.
Comparing Expert Systems Built Using Different Uncertain Inference Systems
Vaughan, David S., Perrin, Bruce M., Yadrick, Robert M.
This study compares the inherent intuitiveness or usability of the most prominent methods for managing uncertainty in expert systems, including those of EMYCIN, PROSPECTOR, Dempster-Shafer theory, fuzzy set theory, simplified probability theory (assuming marginal independence), and linear regression using probability estimates. Participants in the study gained experience in a simple, hypothetical problem domain through a series of learning trials. They were then randomly assigned to develop an expert system using one of the six Uncertain Inference Systems (UISs) listed above. Performance of the resulting systems was then compared. The results indicate that the systems based on the PROSPECTOR and EMYCIN models were significantly less accurate for certain types of problems compared to systems based on the other UISs. Possible reasons for these differences are discussed.
Experiments Using Belief Functions and Weights of Evidence incorporating Statistical Data and Expert Opinions
McLeish, Mary, Yao, P., Cecile, M., Stirtzinger, T.
This paper presents some ideas and results of using uncertainty management methods in the presence of data in preference to other statistical and machine learning methods. A medical domain is used as a test-bed with data available from a large hospital database system which collects symptom and outcome information about patients. Data is often missing, of many variable types and sample sizes for particular outcomes is not large. Uncertainty management methods are useful for such domains and have the added advantage of allowing for expert modification of belief values originally obtained from data. Methodological considerations for using belief functions on statistical data are dealt with in some detail. Expert opinions are Incorporated at various levels of the project development and results are reported on an application to liver disease diagnosis. Recent results contrasting the use of weights of evidence and logistic regression on another medical domain are also presented.
Induction, of and by Probability
This paper examines some methods and ideas underlying the author's successful probabilistic learning systems(PLS), which have proven uniquely effective and efficient in generalization learning or induction. While the emerging principles are generally applicable, this paper illustrates them in heuristic search, which demands noise management and incremental learning. In our approach, both task performance and learning are guided by probability. Probabilities are incrementally normalized and revised, and their errors are located and corrected.