Goto

Collaborating Authors

 Statistical Learning


A Simulated Annealing Clustering Algorithm Based On Center Perturbation Using Gaussian Mutation

AAAI Conferences

Clustering, the unsupervised classification of objects into groups, is a widely used technique in exploratory data analysis. The clustering problem is a very complex one, and a popular heuristic for solving it is the Simulated Annealing (SA) algorithm. SA is an approximation algorithm that involves generating a neighborhood solution by perturbing the current solution in a small, yet meaningful way. This new solution is accepted with a probability of 1 if it is quantitatively better than the current solution, and accepted according to the Metropolis criterion otherwise. Cluster quality is measured using the Sum of Squared Error (SSE) criterion. This paper presents an SA algorithm that uses a new type of perturbation to generate solutions. Whereas most SA clustering algorithms perturb data point memberships directly, our algorithm perturbs a randomly chosen center using Gaussian mutation, and then reassigns data points in a nearest neighbor fashion. Experimental results on a diverse collection of data sets demonstrate that our algorithm has comparable effectiveness to other SA algorithms, while being much faster due to its simplicity.


MDL-Based Unsupervised Attribute Ranking

AAAI Conferences

In the present paper we propose an unsupervised attribute ranking method based on evaluating the quality of clustering that each attribute produces by partitioning the data into subsets according to its values. We use the Minimum Description Length (MDL) principle to evaluate the quality of clustering and describe an algorithm for attribute ranking and a related clustering algorithm. Both algorithms are empirically evaluated on benchmark data sets. The experiments show that the MDL-based ranking performs closely to the supervised information gain ranking and thus improves the performance of the EM and k-means clustering algorithms in purely unsupervised setting.


An Accelerated Nearest Neighbor Search Method for the K-Means Clustering Algorithm

AAAI Conferences

K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, the nearest neighbor search step of this algorithm can be computationally expensive, as the distance between each input vector and all cluster centers need to be calculated. To accelerate this step, a computationally inexpensive distance estimation method can be tried first, resulting in the rejection of candidate centers that cannot possibly be the nearest center to the input vector under consideration. This way, the computational requirements of the search can be reduced as most of the full distance computations become unnecessary. In this paper, a fast nearest neighbor search method that rejects impossible centers to accelerate the k-means clustering algorithm is presented. Our method uses geometrical relations among the input vectors and the cluster centers to reject many unlikely centers that are not typically rejected by similar approaches. Experimental results show that the method can reduce the number of distance computations significantly without degrading the clustering accuracy.


An Ensemble Approach to Instance-Based Regression Using Stretched Neighborhoods

AAAI Conferences

Instance-based regression methods generate solutions from prior solutions within a neighborhood of the input query. Their performance depends on both neighborhood selection criteria and on the method for generating new solutions from the values of prior instances. This paper proposes a new approach to addressing both problems, in which solutions are generated by an ensemble of solutions of local linear regression models built for a collection of "stretched" neighborhoods of the query. Each neighborhood is generated by relaxing a different dimension of the problem space. The rationale is to enable major change trends along that dimension to have increased influence on the corresponding model. The approach is evaluated for two candidate relaxation approaches, gradient-based and based on fixed profiles, and compared to baselines of k-NN and using a radius-based spherical neighborhood in n-dimensional space. Results in four test domains show up to 15 percent improvement over baselines, and suggest that the approach could be particularly useful in domains for which the space of prior instances is sparse.


Feature Ranking and Support Vector Machines Classification Analysis of the NSL-KDD Intrusion Detection Corpus

AAAI Conferences

Currently, signature based Intrusion Detection Systems (IDS) approaches are inadequate to address threats posed to networked systems by zero-day exploits. Statistical machine learning techniques offer a great opportunity to mitigate these threats. However, at this point, statistical based IDS systems are not mature enough to be implemented in realtime systems and the techniques to be used are not sufficiently understood. This study focuses on a recently expanded corpus for IDS analysis. Feature analysis and Support Vector Machines classification are performed to obtain a better understanding of the corpus and to establish a baseline set of results which can be used by other studies for comparison. Results of the classification and feature analysis are discussed.


Automatic Detection of Nominal Entities in Speech for Enriched Content Search

AAAI Conferences

In this work, a methodology is developed to detect sentient actors in spoken stories. Meta-tags are then saved to XML files associated with the audio files. A recursive approach is used to find actor candidates and features which are then classified using machine learning approaches. Results of the study indicate that the methodology performed well on a narrative based corpus of childrenโ€™s stories. Using Support Vector Machines for classification, an F-measure accuracy score of 86% was achieved for both named and unnamed entities. Additionally, feature analysis indicated that speech features were very useful when detecting unnamed actors.


Cluster Analysis for Optimal Indexing

AAAI Conferences

High-dimensional indexing is an important area of current research, especially for range and kNN queries. This work introduces clustering for the sake of indexing. The goal is to develop new clustering methods designed to optimize the data partitioning for an indexing-specific tree structure instead of finding data distribution-based clusters. We focus on iDistance, a state-of-the-art high-dimensional indexing method, and take a basic approach to solving this new problem. By utilizing spherical clusters in an unsupervised Expectation Maximization algorithm dependent upon local density and cluster overlap, we create a partitioning of the space providing balanced segmentation for a B+-tree. We also look at the novel idea of reclustering for a specific indexing method by taking the output of one clustering method and reclustering it for use in an index. The algorithms are then tested and evaluated based on our error metric and iDistance query performance.


Bias and Variance Optimization for SVMs Model Selection

AAAI Conferences

Support vector machines (SVMs) are among the most used methods for pattern recognition. Acceptable results have been obtained with such methods in many domains and applications. However, as most learning algorithms, SVMs have hyperparameters that influence the effectiveness of the generated model. Thus, choosing adequate values for such hyperparameters is critical in order to obtain satisfactory results for a given classification task, a problem known as model selection. This paper introduces a novel model selection approach for SVMs based on multi-objective optimization and on the bias and variance definition. We propose an evolutionary algorithm that aims to select the configuration of hyperparameters that optimizes a trade-off between estimates of bias and variance; two factors that are closely related to the model accuracy and complexity. The proposed technique is evaluated using a suite of benchmark data sets for classification. Experimental results show the validity of our approach. We found that the model selection criteria resulted very helpful for selecting highly effective classification models.


Comparing Frequency- and Style-Based Features for Twitter Author Identification

AAAI Conferences

Author identification is a subfield of Natural Language Processing (NLP) that uses machine learning techniques to identify the author of a text. Most previous research focused on long texts with the assumption that a minimum text length threshold exists under which author identification would no longer be effective. This paper examines author identification in short texts far below this threshold, focusing on messages retrieved from Twitter (maximum length: 140 characters) to determine the most effective feature set for author identification. Both Bag-of-Words (BOW) and Style Marker feature sets were extracted and evaluated through a series of 15 experiments involving up to 12 authors with large and small dataset sizes. Support Vector Machines (SVM) were used for all experiments. Our results achieve classification accuracies approaching that of longer texts, even for small dataset sizes of 60 training instances per author. Style Marker feature sets were found to be significantly more useful than BOW feature sets as well as orders of magnitude faster, and are therefore suggested for potential applications in future research.


Quantum Annealing for Dirichlet Process Mixture Models with Applications to Network Clustering

arXiv.org Machine Learning

We developed a new quantum annealing (QA) algorithm for Dirichlet process mixture (DPM) models based on the Chinese restaurant process (CRP). QA is a parallelized extension of simulated annealing (SA), i.e., it is a parallel stochastic optimization technique. Existing approaches [Kurihara et al. UAI2009, Sato et al. UAI2009] and cannot be applied to the CRP because their QA framework is formulated using a fixed number of mixture components. The proposed QA algorithm can handle an unfixed number of classes in mixture models. We applied QA to a DPM model for clustering vertices in a network where a CRP seating arrangement indicates a network partition. A multi core processor was used for running QA in experiments, the results of which show that QA is better than SA, Markov chain Monte Carlo inference, and beam search at finding a maximum a posteriori estimation of a seating arrangement in the CRP. Since our QA algorithm is as easy as to implement the SA algorithm, it is suitable for a wide range of applications.