AITopics

center perturbation, gaussian mutation, simulated annealing clustering algorithm

Clustering, the unsupervised classification of objects into groups, is a widely used technique in exploratory data analysis. The clustering problem is a very complex one, and a popular heuristic for solving it is the Simulated Annealing (SA) algorithm. SA is an approximation algorithm that involves generating a neighborhood solution by perturbing the current solution in a small, yet meaningful way. This new solution is accepted with a probability of 1 if it is quantitatively better than the current solution, and accepted according to the Metropolis criterion otherwise. Cluster quality is measured using the Sum of Squared Error (SSE) criterion. This paper presents an SA algorithm that uses a new type of perturbation to generate solutions. Whereas most SA clustering algorithms perturb data point memberships directly, our algorithm perturbs a randomly chosen center using Gaussian mutation, and then reassigns data points in a nearest neighbor fashion. Experimental results on a diverse collection of data sets demonstrate that our algorithm has comparable effectiveness to other SA algorithms, while being much faster due to its simplicity.

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)

Markov, Zdravko (Central Connecticut State University)

MDL-Based Unsupervised Attribute Ranking

mdl-based unsupervised attribute ranking

In the present paper we propose an unsupervised attribute ranking method based on evaluating the quality of clustering that each attribute produces by partitioning the data into subsets according to its values. We use the Minimum Description Length (MDL) principle to evaluate the quality of clustering and describe an algorithm for attribute ranking and a related clustering algorithm. Both algorithms are empirically evaluated on benchmark data sets. The experiments show that the MDL-based ranking performs closely to the supervised information gain ranking and thus improves the performance of the EM and k-means clustering algorithms in purely unsupervised setting.

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory > Minimum Complexity Machines (0.53)

Fausett, Adam (Louisiana State University in Shreveport) | Celebi, M. Emre (Louisiana State University in Shreveport)

An Accelerated Nearest Neighbor Search Method for the K-Means Clustering Algorithm

accelerated nearest neighbor search method, k-means clustering algorithm

K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, the nearest neighbor search step of this algorithm can be computationally expensive, as the distance between each input vector and all cluster centers need to be calculated. To accelerate this step, a computationally inexpensive distance estimation method can be tried first, resulting in the rejection of candidate centers that cannot possibly be the nearest center to the input vector under consideration. This way, the computational requirements of the search can be reduced as most of the full distance computations become unnecessary. In this paper, a fast nearest neighbor search method that rejects impossible centers to accelerate the k-means clustering algorithm is presented. Our method uses geometrical relations among the input vectors and the cluster centers to reject many unlikely centers that are not typically rejected by similar approaches. Experimental results show that the method can reduce the number of distance computations significantly without degrading the clustering accuracy.

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

An Ensemble Approach to Instance-Based Regression Using Stretched Neighborhoods

Jalali, Vahid (Indiana University) | Leake, David (Indiana University)

Instance-based regression methods generate solutions from prior solutions within a neighborhood of the input query. Their performance depends on both neighborhood selection criteria and on the method for generating new solutions from the values of prior instances. This paper proposes a new approach to addressing both problems, in which solutions are generated by an ensemble of solutions of local linear regression models built for a collection of "stretched" neighborhoods of the query. Each neighborhood is generated by relaxing a different dimension of the problem space. The rationale is to enable major change trends along that dimension to have increased influence on the corresponding model. The approach is evaluated for two candidate relaxation approaches, gradient-based and based on fixed profiles, and compared to baselines of k-NN and using a radius-based spherical neighborhood in n-dimensional space. Results in four test domains show up to 15 percent improvement over baselines, and suggest that the approach could be particularly useful in domains for which the space of prior instances is sparse.

ensemble approach, instance-based regression, stretched neighborhood

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.87)

Calix, Ricardo A. (Purdue University Calumet) | Sankaran, Rajesh (Argonne National Laboratory)

Feature Ranking and Support Vector Machines Classification Analysis of the NSL-KDD Intrusion Detection Corpus

feature ranking, nsl-kdd intrusion detection corpus, support vector machine classification analysis

Currently, signature based Intrusion Detection Systems (IDS) approaches are inadequate to address threats posed to networked systems by zero-day exploits. Statistical machine learning techniques offer a great opportunity to mitigate these threats. However, at this point, statistical based IDS systems are not mature enough to be implemented in realtime systems and the techniques to be used are not sufficiently understood. This study focuses on a recently expanded corpus for IDS analysis. Feature analysis and Support Vector Machines classification are performed to obtain a better understanding of the corpus and to establish a baseline set of results which can be used by other studies for comparison. Results of the classification and feature analysis are discussed.

Industry:

Information Technology > Security & Privacy (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.89)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (1.00)

Automatic Detection of Nominal Entities in Speech for Enriched Content Search

Calix, Ricardo A. (Purdue University Calumet) | Javadpout, Leili (Louisiana State University) | Khazaeli, Mehdi (Louisiana State University) | Knapp, Gerald M. (Louisiana State University)

In this work, a methodology is developed to detect sentient actors in spoken stories. Meta-tags are then saved to XML files associated with the audio files. A recursive approach is used to find actor candidates and features which are then classified using machine learning approaches. Results of the study indicate that the methodology performed well on a narrative based corpus of children’s stories. Using Support Vector Machines for classification, an F-measure accuracy score of 86% was achieved for both named and unnamed entities. Additionally, feature analysis indicated that speech features were very useful when detecting unnamed actors.

automatic detection, enriched content search, nominal entity, (1 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.53)

Cluster Analysis for Optimal Indexing

Wylie, Tim (Montana State University) | Schuh, Michael A. (Montana State University) | Sheppard, John W. (Montana State University) | Angryk, Rafal A. (Montana State University)

High-dimensional indexing is an important area of current research, especially for range and kNN queries. This work introduces clustering for the sake of indexing. The goal is to develop new clustering methods designed to optimize the data partitioning for an indexing-specific tree structure instead of finding data distribution-based clusters. We focus on iDistance, a state-of-the-art high-dimensional indexing method, and take a basic approach to solving this new problem. By utilizing spherical clusters in an unsupervised Expectation Maximization algorithm dependent upon local density and cluster overlap, we create a partitioning of the space providing balanced segmentation for a B+-tree. We also look at the novel idea of reclustering for a specific indexing method by taking the output of one clustering method and reclustering it for use in an index. The algorithms are then tested and evaluated based on our error metric and iDistance query performance.

cluster analysis, optimal indexing

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.89)

Rosales-Pérez, Alejandro (Instituto Nacional de Astrofísica, Óptica y Electrónica) | Escalante, Hugo Jair (Instituto Nacional de Astrofísica, Óptica y Electrónica) | Gonzalez, Jesus A. (Instituto Nacional de Astrofísica, Óptica y Electrónica) | Reyes-Garcia, Carlos A. (Instituto Nacional de Astrofísica, Óptica y Electrónica)

Bias and Variance Optimization for SVMs Model Selection

Support vector machines (SVMs) are among the most used methods for pattern recognition. Acceptable results have been obtained with such methods in many domains and applications. However, as most learning algorithms, SVMs have hyperparameters that influence the effectiveness of the generated model. Thus, choosing adequate values for such hyperparameters is critical in order to obtain satisfactory results for a given classification task, a problem known as model selection. This paper introduces a novel model selection approach for SVMs based on multi-objective optimization and on the bias and variance definition. We propose an evolutionary algorithm that aims to select the configuration of hyperparameters that optimizes a trade-off between estimates of bias and variance; two factors that are closely related to the model accuracy and complexity. The proposed technique is evaluated using a suite of benchmark data sets for classification. Experimental results show the validity of our approach. We found that the model selection criteria resulted very helpful for selecting highly effective classification models.

bias and variance optimization, svm model selection

Genre: Research Report (0.53)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (1.00)

Green, Rachel M. (Johns Hopkins University / Onyx Point, Inc.) | Sheppard, John W. (Montana State University)

Comparing Frequency- and Style-Based Features for Twitter Author Identification

frequency-and style-based feature, twitter author identification

Author identification is a subfield of Natural Language Processing (NLP) that uses machine learning techniques to identify the author of a text. Most previous research focused on long texts with the assumption that a minimum text length threshold exists under which author identification would no longer be effective. This paper examines author identification in short texts far below this threshold, focusing on messages retrieved from Twitter (maximum length: 140 characters) to determine the most effective feature set for author identification. Both Bag-of-Words (BOW) and Style Marker feature sets were extracted and evaluated through a series of 15 experiments involving up to 12 authors with large and small dataset sizes. Support Vector Machines (SVM) were used for all experiments. Our results achieve classification accuracies approaching that of longer texts, even for small dataset sizes of 60 training instances per author. Style Marker feature sets were found to be significantly more useful than BOW feature sets as well as orders of magnitude faster, and are therefore suggested for potential applications in future research.

Genre: Research Report (0.53)

Industry: Information Technology > Services (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.53)

Sato, Issei, Tanaka, Shu, Kurihara, Kenichi, Miyashita, Seiji, Nakagawa, Hiroshi

Quantum Annealing for Dirichlet Process Mixture Models with Applications to Network Clustering

arXiv.org Machine LearningMay-19-2013

We developed a new quantum annealing (QA) algorithm for Dirichlet process mixture (DPM) models based on the Chinese restaurant process (CRP). QA is a parallelized extension of simulated annealing (SA), i.e., it is a parallel stochastic optimization technique. Existing approaches [Kurihara et al. UAI2009, Sato et al. UAI2009] and cannot be applied to the CRP because their QA framework is formulated using a fixed number of mixture components. The proposed QA algorithm can handle an unfixed number of classes in mixture models. We applied QA to a DPM model for clustering vertices in a network where a CRP seating arrangement indicates a network partition. A multi core processor was used for running QA in experiments, the results of which show that QA is better than SA, Markov chain Monte Carlo inference, and beam search at finding a maximum a posteriori estimation of a seating arrangement in the CRP. Since our QA algorithm is as easy as to implement the SA algorithm, it is suitable for a wide range of applications.

artificial intelligence, customer, machine learning, (18 more...)

arXiv.org Machine Learning

doi: 10.1016/j.neucom.2013.05.019

1305.4325

Country:

North America > United States (0.46)
Asia (0.29)

Genre: Research Report (0.82)

Industry: Consumer Products & Services > Restaurants (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)