Statistical Learning
Decoding Beta-Decay Systematics: A Global Statistical Model for Beta^- Halflives
Costiris, N. J., Mavrommatis, E., Gernoth, K. A., Clark, J. W.
Rev. C) Statistical modeling of nuclear data provides a novel approach to nuclear systematics complementary to established theoretical and phenomenological approaches based on quantum theory. More specifically, fully-connected, multilayer feedforward artificial neural network models are developed using the Levenberg-Marquardt optimization algorithm together with Bayesian regularization and cross-validation. The predictive performance of models emerging from extensive computer experiments is compared with that of traditional microscopic and phenomenological models as well as with the performance of other learning systems, including earlier neural network models as well as the support vector machines recently applied to the same problem. In discussing the results, emphasis is placed on predictions for nuclei that are far from the stability line, and especially those involved in the r-process nucleosynthesis. It is found that the new statistical models can match or even surpass the predictive performance of conventional models for beta-decay systematics and accordingly should provide a valuable additional tool for exploring the expanding nuclear landscape. I. INTRODUCTION "Numbers are the within of all things." Among nuclear physicists this need is driven both by the experimental programs of existing and future radioactive ion beam facilities and by the stresses placed on established nuclear structure theory as totally new areas of the nuclear landscape are opened for exploration. For nuclear astrophysicists, such information is intrinsic to an understanding of supernova explosions - the initialization of the explosion, the subsequent neutronization of the core material, and the strength and fate of the shock wave formed - and the nucleosynthesis of heavy elements above Fe, notably the r-process [3, 4, 5]. Both the element distribution on the r-path and the time scale of the r-process are highly sensitive to the ฮฒ-decay properties of the neutron-rich nuclei involved. Except for a few key nuclei, ฮฒ decay of r-process nuclei cannot be studied in terrestrial laboratories, so the required information must come from nuclear models. These include the more phenomenological treatments, such as the Gross Theory (GT), as well as microscopic approaches based on the shell model and the proton-neutron Quasiparticle Random-Phase Approximation (pnQRPA) in various versions.
Local Procrustes for Manifold Embedding: A Measure of Embedding Quality and Embedding Algorithms
Machine Learning manuscript No. (will be inserted by the editor) Abstract We present the Procrustes measure, a novel measure based on Procrustes rotation that enables quantitative comparison of the output of manifold-based embedding algorithms (such as LLE (Roweis and Saul, 2000) and Isomap (Tenenbaum et al, 2000)). The measure also serves as a natural tool when choosing dimension-reduction parameters. We also present two novel dimension-reduction techniques that attempt to minimize the suggested measure, and compare the results of these techniques to the results of existing algorithms. Finally, we suggest a simple iterative method that can be used to improve the output of existing algorithms. Keywords Dimension reducing ยท Manifold learning ยท Procrustes analysis, ยท Local PCA ยท Simulated annealing 1 Introduction Technological advances constantly improve our ability to collect and store large sets of data. The main difficulty in analyzing such high-dimensional data sets is, that the number of observations required to estimate functions at a set level of accuracy grows exponentially with the dimension. This problem, often referred to as the curse of dimensionality, has led to various techniques that attempt to reduce the dimension of the original data. Historically, the main approach to dimension reduction is the linear one. This is the approach used by principle component analysis (PCA) and factor analysis (see Mardia et al, 1979, for both).
Belief Propagation and Beyond for Particle Tracking
Chertkov, Michael, Kroc, Lukas, Vergassola, Massimo
We describe a novel approach to statistical learning from particles tracked while moving in a random environment. The problem consists in inferring properties of the environment from recorded snapshots. We consider here the case of a fluid seeded with identical passive particles that diffuse and are advected by a flow. Our approach rests on efficient algorithms to estimate the weighted number of possible matchings among particles in two consecutive snapshots, the partition function of the underlying graphical model. The partition function is then maximized over the model parameters, namely diffusivity and velocity gradient. A Belief Propagation (BP) scheme is the backbone of our algorithm, providing accurate results for the flow parameters we want to learn. The BP estimate is additionally improved by incorporating Loop Series (LS) contributions. For the weighted matching problem, LS is compactly expressed as a Cauchy integral, accurately estimated by a saddle point approximation. Numerical experiments show that the quality of our improved BP algorithm is comparable to the one of a fully polynomial randomized approximation scheme, based on the Markov Chain Monte Carlo (MCMC) method, while the BP-based scheme is substantially faster than the MCMC scheme.
The Structure of Narrative: the Case of Film Scripts
Murtagh, Fionn, Ganz, Adam, McKie, Stewart
We analyze the style and structure of story narrative using the case of film scripts. The practical importance of this is noted, especially the need to have support tools for television movie writing. We use the Casablanca film script, and scripts from six episodes of CSI (Crime Scene Investigation). For analysis of style and structure, we quantify various central perspectives discussed in McKee's book, "Story: Substance, Structure, Style, and the Principles of Screenwriting". Film scripts offer a useful point of departure for exploration of the analysis of more general narratives. Our methodology, using Correspondence Analysis, and hierarchical clustering, is innovative in a range of areas that we discuss. In particular this work is groundbreaking in taking the qualitative analysis of McKee and grounding this analysis in a quantitative and algorithmic framework.
A Kernel Method for the Two-Sample Problem
Gretton, Arthur, Borgwardt, Karsten, Rasch, Malte J., Scholkopf, Bernhard, Smola, Alexander J.
We propose a framework for analyzing and comparing distributions, allowing us to design statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS). We present two tests based on large deviation bounds for the test statistic, while a third is based on the asymptotic distribution of this statistic. The test statistic can be computed in quadratic time, although efficient linear time approximations are available. Several classical metrics on distributions are recovered when the function space used to compute the difference in expectations is allowed to be more general (eg. a Banach space). We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.
Adaptive Affinity Propagation Clustering
Wang, Kaijun, Zhang, Junying, Li, Dan, Zhang, Xinna, Guo, Tao
Affinity propagation clustering (AP) has two limitations: it is hard to know what value of parameter 'preference' can yield an optimal clustering solution, and oscillations cannot be eliminated automatically if occur. The adaptive AP method is proposed to overcome these limitations, including adaptive scanning of preferences to search space of the number of clusters for finding the optimal clustering solution, adaptive adjustment of damping factors to eliminate oscillations, and adaptive escaping from oscillations when the damping adjustment technique fails. Experimental results on simulated and real data sets show that the adaptive AP is effective and can outperform AP in quality of clustering results.
Contact state analysis using NFIS and SOM
In this manner, on a simple system, the evolution of contact states, by parallelization of DDA, h as been investigated. So, a comparison between NFIS and SOM results has been presented. The results show appli cability of the proposed methods, by different accuracy, on detection of contact's distribution.
Information Preserving Component Analysis: Data Projections for Flow Cytometry Analysis
Carter, Kevin M., Raich, Raviv, Finn, William G., Hero, Alfred O. III
Flow cytometry is often used to characterize the malignant cells in leukemia and lymphoma patients, traced to the level of the individual cell. Typically, flow cytometric data analysis is performed through a series of 2-dimensional projections onto the axes of the data set. Through the years, clinicians have determined combinations of different fluorescent markers which generate relatively known expression patterns for specific subtypes of leukemia and lymphoma -- cancers of the hematopoietic system. By only viewing a series of 2-dimensional projections, the high-dimensional nature of the data is rarely exploited. In this paper we present a means of determining a low-dimensional projection which maintains the high-dimensional relationships (i.e. information) between differing oncological data sets. By using machine learning techniques, we allow clinicians to visualize data in a low dimension defined by a linear combination of all of the available markers, rather than just 2 at a time. This provides an aid in diagnosing similar forms of cancer, as well as a means for variable selection in exploratory flow cytometric research. We refer to our method as Information Preserving Component Analysis (IPCA).
On the underestimation of model uncertainty by Bayesian K-nearest neighbors
Su, Wanhua, Chipman, Hugh, Zhu, Mu
When using the K-nearest neighbors method, one often ignores uncertainty in the choice of K. To account for such uncertainty, Holmes and Adams (2002) proposed a Bayesian framework for K-nearest neighbors (KNN). Their Bayesian KNN (BKNN) approach uses a pseudo-likelihood function, and standard Markov chain Monte Carlo (MCMC) techniques to draw posterior samples. Holmes and Adams (2002) focused on the performance of BKNN in terms of misclassification error but did not assess its ability to quantify uncertainty. We present some evidence to show that BKNN still significantly underestimates model uncertainty.