Goto

Collaborating Authors

 Statistical Learning


Classifying Scientific Performance on a Metric-by-Metric Basis

AAAI Conferences

In this paper, we outline a system for evaluating the performance of scientific research across a number of outcome metrics (e.g. publications, sales, new hires). Our system is designed to classify research performance into a number of metrics, evaluate each metric’s performance using only data on other metrics, and to cast predictions of future performance by metric. This study shows how data mining techniques can be used to provide a predictive analytic approach to the management of resources for scientific research.


Quantitative Comparison of Linear and Non-linear Dimensionality Reduction Techniques for Solar Image Archives

AAAI Conferences

This work investigates the applicability of several dimensionality reduction techniques for large scale solar data analysis. Using the first solar domain-specific benchmark dataset that contains images of multiple types of phenomena, we investigate linear and non-linear dimensionality reduction methods in order to reduce our storage costs and maintain an accurate representation of our data in a new vector space. We present a comparative analysis between several dimensionality reduction methods and different numbers of target dimensions by utilizing different classifiers in order to determine the percentage of dimensionality reduction that can be achieved on solar data with said methods, and to discover the method that is the most effective for solar images.


Identifying Personality Types Using Document Classification Methods

AAAI Conferences

Are the words that people use indicative of their personality type preferences? In this paper, it is hypothesized that word-usage is not independent of personality type, as measured by the Myers-Briggs Type Indicator (MBTI) personality assessment tool. In-class writing samples were taken from 40 graduate students along with the MBTI. The experiment utilizes naïve Bayes classifiers and Support Vector Machines (SVMs) in an attempt to guess an individual’s personality type based on their word-choice. Classification is also attempted using emotional, social, cognitive, and psychological dimensions elicited by the analysis software, Linguistic Inquiry and Word Count (LIWC). The classifiers are evaluated with 40 distinct trials (leave-one-out cross validation), and parameters are chosen using leave-one-out cross validation of each trial’s training set. The experiment showed that the naïve Bayes classifiers (word-based and LIWC-based) outperformed the SVMs when guessing Sensing-Intuition (S-N) and Thinking-Feeling (T-F).


Proper Noun Semantic Clustering Using Bag-of-Vectors

AAAI Conferences

In this paper, we propose a model for semantic clustering of entities extracted from a text, and we apply it to a Proper Noun classification task.This model is based on a new method to compute the similarity between the entities.Indeed, the classical way of calculating similarity is to build a feature vector or Bag-of-Features for each entity and then use classical similarity functions like Cosine.In practice, the features are contextual, such as words around the different occurrences of each entity. Here, we propose to use an alternative representation for entities, called Bag-of-Vectors, or Bag-of-Bags-of-Features.In this new model, each entity is not defined as a unique vector but as a set of vectors, in which each vector is built based on the contextual features of one occurrence of the entity.In order to use Bag-of-Vectors for clustering, we introduce new versions of classical similarity functions such as Cosine and Scalar Products. Experimentally, we show that the Bag-of-Vectors representation always improve the clustering results compared to classical Bag-of-Features representations.


Syntagmatic, Paradigmatic, and Automatic N-Gram Approaches to Assessing Essay Quality

AAAI Conferences

Computational indices related to n-gram production were developed in order to assess the potential for n-gram indices to predict human scores of essay quality. A regression analyses was conducted on a corpus of 313 argumentative essays. The analyses demonstrated that a variety of n-gram indices were highly correlated to essay quality, but were also highly correlated to the number of words in the text (although many of the n-gram indices were stronger predictors of writing quality than the number of words in a text). A second regression analysis was conducted on a corpus of 88 argumentative essays that were controlled for text length differences. This analysis demonstrated that n-gram indices were still strong predictors of essay quality when text length was not a factor.


Emotion Expression 3-D Synthesis From Predicted Emotion Magnitudes

AAAI Conferences

Many studies have been conducted on how to detect emotion classes or magnitudes from multimedia information such as text, audio, and images. However, the methods that can use predicted emotion classes and magnitudes to render emotion expressions in Embodied Conversational Agents (ECA) are still unclear. This paper proposes a computer graphics methodology that uses predicted non-linear regression values to render facial expressions using mesh morphing techniques. Results of the rendering technique are presented and discussed.


Evolving Kernel Functions with Particle Swarms and Genetic Programming

AAAI Conferences

The Support Vector Machine has gained significant popularity over recent years as a kernel-based supervised learning technique. However, choosing the appropriate kernel function and its associated parameters is not a trivial task. The kernel is often chosen from several widely-used and general-purpose functions, and the parameters are then empirically tuned for the best results on a specific data set. This paper explores the use of Particle Swarm Optimization and Genetic Programming as evolutionary approaches to evolve effective kernel functions for a given dataset. Rather than using expert knowledge, we evolve kernel functions without human-guided knowledge or intuition. Our results show consistently better SVM performance with evolved kernels over a variety of traditional kernels on several datasets.


Efficient Methods for Unsupervised Learning of Probabilistic Models

arXiv.org Artificial Intelligence

Interpreting neural spike trains, compressing video, identifying features in DNA microarrays, and recognizing particles in high energy physics all rely upon the ability to find and model complex structure in a high dimensional space. Despite their great promise, high dimensional probabilistic models are frequently computationally intractable to work with in practice. In this thesis I develop solutions to overcome this intractability, primarily in the context of energy based models. A common cause of intractability is that model distributions cannot be analytically normalized. Probabilities can only be computed up to a constant, making training exceedingly difficult. To solve this problem I propose'minimum probability flow learning', a variational technique for parameter estimation in such models.


Challenges and Opportunities in Applied Machine Learning

AI Magazine

Machine learning research is often conducted in vitro, divorced from motivating practical applications. A researcher might develop a new method for the general task of classification, then assess its utility by comparing its performance (such as accuracy or AUC) to that of existing classification models on publicly available datasets. In terms of advancing machine learning as an academic discipline, this approach has thus far proven quite fruitful. However, it is our view that the most interesting open problems in machine learning are those that arise during its application to real-world problems. We illustrate this point by reviewing two of our interdisciplinary collaborations, both of which have posed unique machine learning problems, providing fertile ground for novel research.


A Discussion on Parallelization Schemes for Stochastic Vector Quantization Algorithms

arXiv.org Machine Learning

This paper studies parallelization schemes for stochastic Vector Quantization algorithms in order to obtain time speed-ups using distributed resources. We show that the most intuitive parallelization scheme does not lead to better performances than the sequential algorithm. Another distributed scheme is therefore introduced which obtains the expected speed-ups. Then, it is improved to fit implementation on distributed architectures where communications are slow and inter-machines synchronization too costly. The schemes are tested with simulated distributed architectures and, for the last one, with Microsoft Windows Azure platform obtaining speed-ups up to 32 Virtual Machines.