Goto

Collaborating Authors

 Genre


A Rule-Based Approach For Aligning Japanese-Spanish Sentences From A Comparable Corpora

arXiv.org Artificial Intelligence

The performance of a Statistical Machine Translation System (SMT) system is proportionally directed to the quality and length of the parallel corpus it uses. However for some pair of languages there is a considerable lack of them. The long term goal is to construct a Japanese-Spanish parallel corpus to be used for SMT, whereas, there are a lack of useful Japanese-Spanish parallel Corpus. To address this problem, In this study we proposed a method for extracting Japanese-Spanish Parallel Sentences from Wikipedia using POS tagging and Rule-Based approach. The main focus of this approach is the syntactic features of both languages. Human evaluation was performed over a sample and shows promising results, in comparison with the baseline.


Bayesian nonparametric models for ranked data

arXiv.org Machine Learning

We develop a Bayesian nonparametric extension of the popular Plackett-Luce choice model that can handle an infinite number of choice items. Our framework is based on the theory of random atomic measures, with the prior specified by a gamma process. We derive a posterior characterization and a simple and effective Gibbs sampler for posterior simulation. We develop a time-varying extension of our model, and apply it to the New York Times lists of weekly bestselling books.


A Dataset for StarCraft AI \& an Example of Armies Clustering

arXiv.org Artificial Intelligence

This paper advocates the exploration of the full state of recorded real-time strategy (RTS) games, by human or robotic players, to discover how to reason about tactics and strategy. We present a dataset of StarCraft games encompassing the most of the games' state (not only player's orders). We explain one of the possible usages of this dataset by clustering armies on their compositions. This reduction of armies compositions to mixtures of Gaussian allow for strategic reasoning at the level of the components. We evaluated this clustering method by predicting the outcomes of battles based on armies compositions' mixtures components


Cost-sensitive C4.5 with post-pruning and competition

arXiv.org Artificial Intelligence

Decision tree is an effective classification approach in data mining and machine learning. In applications, test costs and misclassification costs should be considered while inducing decision trees. Recently, some cost-sensitive learning algorithms based on ID3 such as CS-ID3, IDX, \lambda-ID3 have been proposed to deal with the issue. These algorithms deal with only symbolic data. In this paper, we develop a decision tree algorithm inspired by C4.5 for numeric data. There are two major issues for our algorithm. First, we develop the test cost weighted information gain ratio as the heuristic information. According to this heuristic information, our algorithm is to pick the attribute that provides more gain ratio and costs less for each selection. Second, we design a post-pruning strategy through considering the tradeoff between test costs and misclassification costs of the generated decision tree. In this way, the total cost is reduced. Experimental results indicate that (1) our algorithm is stable and effective; (2) the post-pruning technique reduces the total cost significantly; (3) the competition strategy is effective to obtain a cost-sensitive decision tree with low cost.


Data Clustering via Principal Direction Gap Partitioning

arXiv.org Machine Learning

Data clustering has various applications in a wide variety of fields ranging from social and biological sciences, to business, statistics, information retrieval, machine learning and data mining. Clustering refers to the process of grouping data based only on information found in the data which describes its characteristics and relationships. Although humans are generally very good at discovering patterns and classifying objects, clustering algorithms are able to discern similarities in data even when humans are not [6]. The main focus of our research has been document clustering, but we will demonstrate that our methods also work nicely on scientific data. In this paper, we propose an adaptation of the clustering algorithm known as Principal Direction Divisive Partitioning (PDDP) developed by Daniel Boley in [2] which is based Principal Components Analysis (PCA). PCA involves the eigenvector decomposition of a data covariance matrix, or equivalently a singular value decomposition (SVD) of a data matrix after mean centering. The name of our adaptation, Principal Direction Gap Partitioning (PDGP), borrows most of its name from PDDP as it follows many of the same steps that PDDP follows. The word "gap" replaces the word "divisive" in reference to how the algorithm splits data along natural gaps at each step. This concept will be further developed in the following sections, but it should be noted that PDGP is still a divisive algorithm in the same way that PDDP is.


A Logic and Adaptive Approach for Efficient Diagnosis Systems using CBR

arXiv.org Artificial Intelligence

Case Based Reasoning (CBR) is an intelligent way of thinking based on experience and capitalization of already solved cases (source cases) to find a solution to a new problem (target case). Retrieval phase consists on identifying source cases that are similar to the target case. This phase may lead to erroneous results if the existing knowledge imperfections are not taken into account. This work presents a novel solution based on Fuzzy logic techniques and adaptation measures which aggregate weighted similarities to improve the retrieval results. To confirm the efficiency of our solution, we have applied it to the industrial diagnosis domain. The obtained results are more efficient results than those obtained by applying typical measures.


Automated Feedback Generation for Introductory Programming Assignments

arXiv.org Artificial Intelligence

We present a new method for automatically providing feedback for introductory programming problems. In order to use this method, we need a reference implementation of the assignment, and an error model consisting of potential corrections to errors that students might make. Using this information, the system automatically derives minimal corrections to student's incorrect solutions, providing them with a quantifiable measure of exactly how incorrect a given solution was, as well as feedback about what they did wrong. We introduce a simple language for describing error models in terms of correction rules, and formally define a rule-directed translation strategy that reduces the problem of finding minimal corrections in an incorrect program to the problem of synthesizing a correct program from a sketch. We have evaluated our system on thousands of real student attempts obtained from 6.00 and 6.00x. Our results show that relatively simple error models can correct on average 65% of all incorrect submissions.


SERAPH: Semi-supervised Metric Learning Paradigm with Hyper Sparsity

arXiv.org Machine Learning

We propose a general information-theoretic approach called Seraph (SEmi-supervised metRic leArning Paradigm with Hyper-sparsity) for metric learning that does not rely upon the manifold assumption. Given the probability parameterized by a Mahalanobis distance, we maximize the entropy of that probability on labeled data and minimize it on unlabeled data following entropy regularization, which allows the supervised and unsupervised parts to be integrated in a natural and meaningful way. Furthermore, Seraph is regularized by encouraging a low-rank projection induced from the metric. The optimization of Seraph is solved efficiently and stably by an EM-like scheme with the analytical E-Step and convex M-Step. Experiments demonstrate that Seraph compares favorably with many well-known global and local metric learning methods.


Sequence Transduction with Recurrent Neural Networks

arXiv.org Machine Learning

Many machine learning tasks can be expressed as the transformation---or \emph{transduction}---of input sequences into output sequences: speech recognition, machine translation, protein secondary structure prediction and text-to-speech to name but a few. One of the key challenges in sequence transduction is learning to represent both the input and output sequences in a way that is invariant to sequential distortions such as shrinking, stretching and translating. Recurrent neural networks (RNNs) are a powerful sequence learning architecture that has proven capable of learning such representations. However RNNs traditionally require a pre-defined alignment between the input and output sequences to perform transduction. This is a severe limitation since \emph{finding} the alignment is the most difficult aspect of many sequence transduction problems. Indeed, even determining the length of the output sequence is often challenging. This paper introduces an end-to-end, probabilistic sequence transduction system, based entirely on RNNs, that is in principle able to transform any input sequence into any finite, discrete output sequence. Experimental results for phoneme recognition are provided on the TIMIT speech corpus.


Spectral Clustering: An empirical study of Approximation Algorithms and its Application to the Attrition Problem

arXiv.org Machine Learning

Spectral clustering is a now well-known method for clustering which utilizes the spectrum of the data similarity matrix to perform this separation. Since the method relies on solving an eigenvector problem, it is computationally expensive for large datasets. T o overcome this constraint, approximation methods have been developed which aim to reduce running time while maintaining accurate classification. In this article, we summarize and experimentally evaluate several approximation methods for spectral clustering. From an applications standpoint, we employ spectral clustering to solve the so-called attrition problem, where one aims to identify from a set of employees those who are likely to voluntarily leave the company from those who are not. Our study sheds light on the empirical performance of existing approximate spectral clustering methods and shows the applicability of these methods in an important business optimization related problem. Clustering or cluster analysis addresses the problem of separating a set of objects into clusters so that objects within each cluster are more similar to each other than to objects in different clusters. The clustering problem has become ubiquitous in data mining and machine learning with applications ranging from image processing to bioinformatics. What one means by clustering, and the type of clustering desired is application dependent. For example, one may wish to segment an image such as that in Figure 1 (a)-(b). In medical imaging, segmentation may aid in the identification of tumors, assist physicians in surgery and separate anatomical structures. Computer vision applications utilize clustering methods to identify foreign objects in surveillance images or detect road signs for computer piloted vehicles. In statistical analysis, the objects to be clustered may represent individuals in a population viewed as a vector of personal attributes. For example, we will consider the attrition problem: from a dataset of employees one wishes to identify which cluster of employees are likely to voluntarily leave the company and which are not.