Goto

Collaborating Authors

 Bayesian Learning


Oriented and Degree-generated Block Models: Generating and Inferring Communities with Inhomogeneous Degree Distributions

arXiv.org Machine Learning

The stochastic block model is a powerful tool for inferring community structure from network topology. However, it predicts a Poisson degree distribution within each community, while most real-world networks have a heavy-tailed degree distribution. The degree-corrected block model can accommodate arbitrary degree distributions within communities. But since it takes the vertex degrees as parameters rather than generating them, it cannot use them to help it classify the vertices, and its natural generalization to directed graphs cannot even use the orientations of the edges. In this paper, we present variants of the block model with the best of both worlds: they can use vertex degrees and edge orientations in the classification process, while tolerating heavy-tailed degree distributions within communities. We show that for some networks, including synthetic networks and networks of word adjacencies in English text, these new block models achieve a higher accuracy than either standard or degree-corrected block models.


Finding Important Genes from High-Dimensional Data: An Appraisal of Statistical Tests and Machine-Learning Approaches

arXiv.org Machine Learning

Over the past decades, statisticians and machine-learning researchers have developed literally thousands of new tools for the reduction of high-dimensional data in order to identify the variables most responsible for a particular trait. These tools have applications in a plethora of settings, including data analysis in the fields of business, education, forensics, and biology (such as microarray, proteomics, brain imaging), to name a few. In the present work, we focus our investigation on the limitations and potential misuses of certain tools in the analysis of the benchmark colon cancer data (2,000 variables; Alon et al., 1999) and the prostate cancer data (6,033 variables; Efron, 2010, 2008). Our analysis demonstrates that models that produce 100% accuracy measures often select different sets of genes and cannot stand the scrutiny of parameter estimates and model stability. Furthermore, we created a host of simulation datasets and "artificial diseases" to evaluate the reliability of commonly used statistical and data mining tools. We found that certain widely used models can classify the data with 100% accuracy without using any of the variables responsible for the disease. With moderate sample size and suitable pre-screening, stochastic gradient boosting will be shown to be a superior model for gene selection and variable screening from high-dimensional datasets.


Solving Limited Memory Influence Diagrams

Journal of Artificial Intelligence Research

We present a new algorithm for exactly solving decision making problems represented as influence diagrams. We do not require the usual assumptions of no forgetting and regularity; this allows us to solve problems with simultaneous decisions and limited information. The algorithm is empirically shown to outperform a state-of-the-art algorithm on randomly generated problems of up to 150 variables and 10^64 solutions. We show that these problems are NP-hard even if the underlying graph structure of the problem has low treewidth and the variables take on a bounded number of states, and that they admit no provably good approximation if variables can take on an arbitrary number of states.


Tutor Modeling Versus Student Modeling

AAAI Conferences

The current paradigm in student modeling has continued to show the power of its simplifying assumption of knowledge as a binary and monotonically increasing construct, the value of which directly causes the outcome of student answers to questions. Recent efforts have focused on optimizing the prediction accuracy of responses to questions using student models. Incorporating individual student parameter interactions has been an interpretable and principled approach which has improved the performance of this task, as demonstrated by its application in the 2010 KDD Cup challenge on Educational Data. Performance prediction, however, can have limited practical utility. The greatest utility of such student models can be their ability to model the tutor and the attributes of the tutor which are causing learning. Harnessing the same simplifying assumption of learning used in student modeling, we can turn this model on its head to effectively tease out the tutor attributes causing learning and begin to optimize the tutor model to benefit the student model.


Classifying Scientific Performance on a Metric-by-Metric Basis

AAAI Conferences

In this paper, we outline a system for evaluating the performance of scientific research across a number of outcome metrics (e.g. publications, sales, new hires). Our system is designed to classify research performance into a number of metrics, evaluate each metric’s performance using only data on other metrics, and to cast predictions of future performance by metric. This study shows how data mining techniques can be used to provide a predictive analytic approach to the management of resources for scientific research.


Customizing Question Selection in Conversational Case-Based Reasoning

AAAI Conferences

Conversational case-based reasoning systems use an interactive dialog to retrieve stored cases. Normally the ordering of questions in this dialog is chosen based only on their discriminativeness. However, because the user may not be able to answer all questions, even highly discriminative questions are not guaranteed to provide information. This paper presents a customization method CCBR systems can apply to adjust entropy-based discriminativeness considerations by predictions of user ability to answer questions. The method uses a naive Bayesian classifier to classify users into user groups based on the questions they answer, applies information from group profiles to predict which future questions they are likely to be able to answer, and selects the next questions to ask based on a combination of information gain and response likelihood. The method was evaluated for a mix of simulated user groups, each associated with particular probabilities for answering questions about each case indexing feature, in four sample domains. For simulated users with varying abilities to answer particular questions, results showed improvement in dialog length over a non-customized entropy-based approach in all test domains.


Identifying Personality Types Using Document Classification Methods

AAAI Conferences

Are the words that people use indicative of their personality type preferences? In this paper, it is hypothesized that word-usage is not independent of personality type, as measured by the Myers-Briggs Type Indicator (MBTI) personality assessment tool. In-class writing samples were taken from 40 graduate students along with the MBTI. The experiment utilizes naïve Bayes classifiers and Support Vector Machines (SVMs) in an attempt to guess an individual’s personality type based on their word-choice. Classification is also attempted using emotional, social, cognitive, and psychological dimensions elicited by the analysis software, Linguistic Inquiry and Word Count (LIWC). The classifiers are evaluated with 40 distinct trials (leave-one-out cross validation), and parameters are chosen using leave-one-out cross validation of each trial’s training set. The experiment showed that the naïve Bayes classifiers (word-based and LIWC-based) outperformed the SVMs when guessing Sensing-Intuition (S-N) and Thinking-Feeling (T-F).


Real-Time Filtering for Pulsing Public Opinion in Social Media

AAAI Conferences

When analysing social media conversations, in search of the public opinion about an unfolding event that is be- ing discussed in real-time (e.g., presidential debates, major speeches, etc.), it is important to distinguish between two groups of participants: opinion-makers and opinion-holders. To address this problem, we propose a supervised machine-learning approach, which uses inexpensively acquired labeled data from monothematic Twitter accounts to learn a binary classifier for the labels “political account” (opinion-makers) and “non-political account” (opinion-holders). While the classifier has a 83% accuracy on individual tweets, when applied to the last 200 tweets from accounts of a set of 1000 Twitter users, it classifies accounts with a 97% accuracy. This high accuracy derives from our decision to incorporate information about classifier probability into the classification. Our work demonstrates that machine learning algorithms can play a critical role in improving the quality of social media analytics and understanding, whose importance is increasing as social media adoption becomes widespread.


Sparse Signal Recovery in the Presence of Intra-Vector and Inter-Vector Correlation

arXiv.org Machine Learning

This work discusses the problem of sparse signal recovery when there is correlation among the values of non-zero entries. We examine intra-vector correlation in the context of the block sparse model and inter-vector correlation in the context of the multiple measurement vector model, as well as their combination. Algorithms based on the sparse Bayesian learning are presented and the benefits of incorporating correlation at the algorithm level are discussed. The impact of correlation on the limits of support recovery is also discussed highlighting the different impact intra-vector and inter-vector correlations have on such limits.


Efficient Methods for Unsupervised Learning of Probabilistic Models

arXiv.org Artificial Intelligence

Interpreting neural spike trains, compressing video, identifying features in DNA microarrays, and recognizing particles in high energy physics all rely upon the ability to find and model complex structure in a high dimensional space. Despite their great promise, high dimensional probabilistic models are frequently computationally intractable to work with in practice. In this thesis I develop solutions to overcome this intractability, primarily in the context of energy based models. A common cause of intractability is that model distributions cannot be analytically normalized. Probabilities can only be computed up to a constant, making training exceedingly difficult. To solve this problem I propose'minimum probability flow learning', a variational technique for parameter estimation in such models.