Statistical Learning
Reproducing kernel Hilbert space based estimation of systems of ordinary differential equations
González, Javier, Vujačić, Ivan, Wit, Ernst
Nonlinear systems of differential equations have attracted the interest in fields like system biology, ecology or biochemistry, due to their flexibility and their ability to describe dynamical systems. Despite the importance of such models in many branches of science they have not been the focus of systematic statistical analysis until recently. In this work we propose a general approach to estimate the parameters of systems of differential equations measured with noise. Our methodology is based on the maximization of the penalized likelihood where the system of differential equations is used as a penalty. To do so, we use a Reproducing Kernel Hilbert Space approach that allows us to formulate the estimation problem as an unconstrained numeric maximization problem easy to solve. The proposed method is tested with synthetically simulated data and it is used to estimate the unobserved transcription factor CdaR in Steptomyes coelicolor using gene expression data of the genes it regulates. Keywords: System of ordinary differential equations, differential operator, reproducing kernel Hilbert space, gene regulatory network 1. Introduction Despite the fact that differential equations are a common modelling tool within science and engineering, statistical methods for estimating such models have only received widespread attention during the last few years. The difficulty of solving differential equations in general has been a major stumbling block for efficient statistical procedures.
Highly comparative feature-based time-series classification
Fulcher, Ben D., Jones, Nick S.
A highly comparative, feature-based approach to time series classification is introduced that uses an extensive database of algorithms to extract thousands of interpretable features from time series. These features are derived from across the scientific time-series analysis literature, and include summaries of time series in terms of their correlation structure, distribution, entropy, stationarity, scaling properties, and fits to a range of time-series models. After computing thousands of features for each time series in a training set, those that are most informative of the class structure are selected using greedy forward feature selection with a linear classifier. The resulting feature-based classifiers automatically learn the differences between classes using a reduced number of time-series properties, and circumvent the need to calculate distances between time series. Representing time series in this way results in orders of magnitude of dimensionality reduction, allowing the method to perform well on very large datasets containing long time series or time series of different lengths. For many of the datasets studied, classification performance exceeded that of conventional instance-based classifiers, including one nearest neighbor classifiers using Euclidean distances and dynamic time warping and, most importantly, the features selected provide an understanding of the properties of the dataset, insight that can guide further scientific investigation.
A Novel Methodology for Processing Probabilistic Knowledge Bases Under Maximum Entropy
Kern-Isberner, Gabriele (TU Dortmund University) | Wilhelm, Marco (TU Dortmund University) | Beierle, Christoph (University of Hagen)
Probabilistic reasoning under the so-called principle of maximum entropy is a viable and convenient alternative to Bayesian networks, relieving the user from providing complete (local) probabilistic information and observing rigorous conditional independence assumptions. In this paper, we present a novel approach to performing computational MaxEnt reasoning that makes use of symbolic computations instead of graph-based techniques. Given a probabilistic knowledge base, we encode the MaxEnt optimization problem into a system of polynomial equations, and then apply Gröbner basis theory to find MaxEnt inferences as solutions to the polynomials. We illustrate our approach with an example of a knowledge base that represents findings on fraud detection in enterprises.
Implementation of a Transformation System for Relational Probabilistic Knowledge Bases Simplifying the Maximum Entropy Model Computation
Beierle, Christoph (University of Hagen) | Höhnerbach, Markus (University of Hagen) | Marto, Marcus (University of Hagen)
The maximum entropy (ME) model of a knowledge base R consisting of relational probabilistic conditionals can be defined referring to the set of all ground instances of the conditionals. The logic FO-PCL employs the notion of parametric uniformity for avoiding the full grounding of R. We present an implementation of a rule system transforming R into a knowledge base that is parametrically uniform and has the same ME model, simplifying the ME model computation. The implementation provides different execution and evaluation modes, including the generation of all possible solutions.
Mitigating the Curse of Dimensionality for Exact kNN Retrieval
Schuh, Michael A. (Montana State University) | Wylie, Tim (University of Alberta) | Angryk, Rafal A. (Georgia State University)
Efficient data indexing and exact k-nearest-neighbor (kNN) retrieval are still challenging tasks in high-dimensional spaces. This work highlights the difficulties of indexing in high-dimensional and tightly-clustered dataspaces by exploring several important tunable parameters for optimizing kNN query performance using the iDistance and iDStar algorithms. We experiment on real and synthetic datasets of varying size, cluster density, and dimensionality, and compare performance primarily through filter-and-refine efficiency and execution time. Results show great variability over parameter values and provide new insights and justifications in support of prior best-use practices. Local segmentation with iDStar consistently outperforms iDistance in any clustered space below 256 dimensions, setting a new benchmark for efficient and exact kNN retrieval in high-dimensional spaces. We propose several directions of future work to further increase performance in high-dimensional real-world settings.
Extreme Logistic Regression: A Large Scale Learning Algorithm with Application to Prostate Cancer Mortality Prediction
Ngufor, Che (George Mason University) | Wojtusiak, Janusz (George Mason University) | Hooker, Andrea (George Mason University) | Oz, Talha (George Mason University) | Hadley, Jack (George Mason University)
With the recent popularity of electronic medical records, enormous amount of medical data is being generated every day at an exponential rate.Machine learning methods have been shown in many studies to be capable of producing automatic medical diagnostic models such as automated prognostic models. However, many powerful machine learning algorithms such as support vector machine (SVM), Random Forest (RF) or Kernel Logistic Regression (KLR) are unbearably slow for very large datasets. This makes their use in medical research limited to small to medium scale problems.This study is motivated by an ongoing research on prostate cancer mortality prediction for a national representative of US population where the SVM and RF took several hours or days to trainwhereas simple linear methods such as logistic regression or linear discriminant analysis take minutes or even seconds.Because, most real-world problems are non-linear, this paper presents a large scale algorithm enabling a recently proposed least squares extreme logistic regression to learn very large datasets. The algorithm is shown on a case study of mortality prediction for men diagnosed with early stage prostate cancer to provide very fast and more accurate result than standard statistical methods.
Toward Building Automatic Affect Recognition Machine Using Acoustics Features
Marpaung, Andreas H. (University of Central Florida) | Gonzalez, Avelino (University of Central Florida)
Research in the field of Affective Computing on affect recognition through speech has used a “fishing expedition” approach. Although some frameworks could achieve certain success rates, many of these approaches missed the theory behind the underlying voice and speech production mechanism. In this work, we found some correlation among the acoustic parameters (paralinguistic/non-verbal speech content) in the physiological mechanism of voice production. Furthermore, we also found some correlation when analyzing their relationships statistically. Aligned with this finding, we implemented our framework using the K-Nearest Neighbors (KNN) algorithm. Although our work is still in its infancy, we believe this context-free approach will bring us forward toward creating an intelligent agent with affect recognition ability. This paper describes the problem, our approach and our results.
Strategy Mining
Xu, Xiaoxi (University of Massachusetts, Amherst) | Jensen, David (University of Massachusetts, Amherst) | Rissland, Edwina L. (University of Massachusetts, Amherst)
Strategy mining is a new area of research about discovering strategies for decision-making. It is motivated by how similarity is assessed in retrospect in law. In the legal domain, when both case facts and court decisions are present, it is often useful to assess similarity by accounting for both case facts and case outcomes. In this paper, we formulate the strategy mining problem as a clustering problem with the goal of finding clusters that represent disparate conditional dependency of decision labels on other features. Existing clustering algorithms are inappropriate to cluster dependency because they either assume feature independence, such as K-means, or only consider the co-occurrence of features without explicitly modeling the special dependency of the decision label on other features, such as Latent Dirichlet Allocation (LDA). We propose an Expectation Maximization (EM) style unsupervised learning algorithm for dependency clustering. Like EM, our algorithm is grounded in statistical learning theory. It minimizes the empirical risk of decision tree learning. Unlike other clustering algorithms, our algorithm is irrelevant-feature resistant, and its learned clusters modeled by decision trees are strongly interpretable and predictive. We systematically evaluate both the convergence property and solution quality of our algorithm using a common law dataset comprised of actual cases. Experimental results show that our algorithm significantly outperforms K-means and LDA on clustering dependency
Histogram-Based Method for Effective Initialization of the K-Means Clustering Algorithm
Gingles, Caroline (Louisiana State University in Shreveport) | Celebi, M. Emre (Louisiana State University in Shreveport)
K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, this algorithm is highly sensitive to the initial selection of the cluster centers. Numerous initialization methods have been proposed to address this drawback. Many of these methods, however, have superlinear complexity in the number of data points, which makes them impractical for large data sets. On the other hand, linear methods are often random and/or sensitive to the order in which the data points are processed. These methods are generally unreliable in that the quality of their results is unpredictable. In this paper, we propose a linear, deterministic, and order-invariant initialization method based on multidimensional histograms. Experiments on a diverse collection of data sets from the UCI Machine Learning Repository demonstrate the superiority of our method over the well-known maximin method.
SMART Electronic Legal Discovery Via Topic Modeling
George, Clint Pazhayidam (University of Florida) | Puri, Sahil (University of Florida) | Wang, Daisy Zhe (University of Florida) | Wilson, Joseph N. (University of Florida) | Hamilton, William F. (University of Florida)
Electronic discovery is an interesting subproblem of information retrieval in which one identifies documents that are potentially relevant to issues and facts of a legal case from an electronically stored document collection (a corpus). In this paper, we consider representing documents in a topic space using the well-known topic models such as latent Dirichlet allocation and latent semantic indexing, and solving the information retrieval problem via finding document similarities in the topic space rather doing it in the corpus vocabulary space. We also develop an iterative SMART ranking and categorization framework including human-in-the-loop to label a set of seed (training) documents and using them to build a semi-supervised binary document classification model based on Support Vector Machines. To improve this model, we propose a method for choosing seed documents from the whole population via an active learning strategy. We report the results of our experiments on a real dataset in the electronic discovery domain.