Goto

Collaborating Authors

 Statistical Learning


Mathematical Language Processing: Automatic Grading and Feedback for Open Response Mathematical Questions

arXiv.org Machine Learning

While computer and communication technologies have provided effective means to scale up many aspects of education, the submission and grading of assessments such as homework assignments and tests remains a weak link. In this paper, we study the problem of automatically grading the kinds of open response mathematical questions that figure prominently in STEM (science, technology, engineering, and mathematics) courses. Our data-driven framework for mathematical language processing (MLP) leverages solution data from a large number of learners to evaluate the correctness of their solutions, assign partial-credit scores, and provide feedback to each learner on the likely locations of any errors. MLP takes inspiration from the success of natural language processing for text data and comprises three main steps. First, we convert each solution to an open response mathematical question into a series of numerical features. Second, we cluster the features from several solutions to uncover the structures of correct, partially correct, and incorrect solutions. We develop two different clustering approaches, one that leverages generic clustering algorithms and one based on Bayesian nonparametrics. Third, we automatically grade the remaining (potentially large number of) solutions based on their assigned cluster and one instructor-provided grade per cluster. As a bonus, we can track the cluster assignment of each step of a multistep solution and determine when it departs from a cluster of correct solutions, which enables us to indicate the likely locations of errors to learners. We test and validate MLP on real-world MOOC data to demonstrate how it can substantially reduce the human effort required in large-scale educational platforms.


Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

arXiv.org Machine Learning

We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning $(L/\mu)^2$ (where $L$ is a bound on the smoothness and $\mu$ on the strong convexity) to a linear dependence on $L/\mu$. Furthermore, we show how reweighting the sampling distribution (i.e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results. We also discuss importance sampling for SGD more broadly and show how it can improve convergence also in other scenarios. Our results are based on a connection we make between SGD and the randomized Kaczmarz algorithm, which allows us to transfer ideas between the separate bodies of literature studying each of the two methods. In particular, we recast the randomized Kaczmarz algorithm as an instance of SGD, and apply our results to prove its exponential convergence, but to the solution of a weighted least squares problem rather than the original least squares problem. We then present a modified Kaczmarz algorithm with partially biased sampling which does converge to the original least squares solution with the same exponential convergence rate.


Semi-Supervised Sparse Coding

arXiv.org Machine Learning

Given a data sample with its feature vector, SC tries to learn a codebook with some codeworks, and approximate the data sample as the linear combination of the codewords. SC assume that only a few codewords in the codebook are enough to represent the data sample, thus the combination coefficients should be sparse, i.e. most of the coefficients are zeros, leaving only a few of them non-zeros. The linear combination coefficients of the data sample could be its new representation. Because they are sparse, the coefficient vector is often referred to as the sparse code. To solve the sparse code, one usually minimizes the approximation error with regard to the codebook and the sparse code, and at the same time seeks the sparsity of the sparse code. Although SC has been used in many pattern recognition applications, such as palmprint recognition [24], dynamic texture recognition [25], human action recognition [26], [27], [28], speech recognition [29], digit recognition [30], image annotation [31], [32], [33], and face recognition [34], in most cases, SC is used as an unsupervised learning method. When SC is performed to the training data set, it is assumed that the class labels of the training samples are unavailable. Then after the sparse codes are learned, they will be used to learn a classifier. Thus the class labels are ignored during the sparse coding procedure.


Prediction and Modularity in Dynamical Systems

arXiv.org Artificial Intelligence

Identifying and understanding modular organizations is centrally important in the study of complex systems. Several approaches to this problem have been advanced, many framed in information-theoretic terms. Our treatment starts from the complementary point of view of statistical modeling and prediction of dynamical systems. It is known that for finite amounts of training data, simpler models can have greater predictive power than more complex ones. We use the trade-off between model simplicity and predictive accuracy to generate optimal multiscale decompositions of dynamical networks into weakly-coupled, simple modules. State-dependent and causal versions of our method are also proposed.


The Fast Convergence of Incremental PCA

arXiv.org Machine Learning

We consider a situation in which we see samples in $\mathbb{R}^d$ drawn i.i.d. from some distribution with mean zero and unknown covariance A. We wish to compute the top eigenvector of A in an incremental fashion - with an algorithm that maintains an estimate of the top eigenvector in O(d) space, and incrementally adjusts the estimate with each new data point that arrives. Two classical such schemes are due to Krasulina (1969) and Oja (1983). We give finite-sample convergence rates for both.


Bayesian Nonparametrics in Topic Modeling: A Brief Tutorial

arXiv.org Machine Learning

Using nonparametric methods has been increasingly explored in Bayesian hierarchical modeling as a way to increase model flexibility. Although the field shows a lot of promise, inference in many models, including Hierachical Dirichlet Processes (HDP), remain prohibitively slow. One promising path forward is to exploit the submodularity inherent in Indian Buffet Process (IBP) to derive near-optimal solutions in polynomial time. In this work, I will present a brief tutorial on Bayesian nonparametric methods, especially as they are applied to topic modeling. I will show a comparison between different non-parametric models and the current state-of-the-art parametric model, Latent Dirichlet Allocation (LDA).


Perfect Clustering for Stochastic Blockmodel Graphs via Adjacency Spectral Embedding

arXiv.org Machine Learning

In many problems arising in the natural sciences, technology, business and politics, it is crucial to understand the specific connections among the objects under study: for example, the interactions between members of a political party; the firing of synapses in a neuronal network; or citation patterns in reference literature. Mathematically, these objects and their connections are modeled as graphs, and a common goal is to find clusters of similar vertices within a graph. Both model-based and heuristic-based techniques have been proposed for clustering the vertices in a graphs [14, 2, 5, 19]. In this paper we focus on probabilistic performance guarantees for spectral-based techniques which 1 have elements of both model-and heuristic-based methods [18, 20]. We study the consistency of mean squared error clustering via the adjacency spectral embedding for three nested classes of models, each an examples of latent position models [7]: - the stochastic blockmodel where vertices in the same cluster are stochastically equivalent [8], - the degree-corrected stochastic blockmodel where stochastic equivalence holds up to a scaling factor [9], - and the random dot product graph where a natural vertex clustering may not exist [27].


Dirichlet Process Parsimonious Mixtures for clustering

arXiv.org Machine Learning

The parsimonious Gaussian mixture models, which exploit an eigenvalue decomposition of the group covariance matrices of the Gaussian mixture, have shown their success in particular in cluster analysis. Their estimation is in general performed by maximum likelihood estimation and has also been considered from a parametric Bayesian prospective. We propose new Dirichlet Process Parsimonious mixtures (DPPM) which represent a Bayesian nonparametric formulation of these parsimonious Gaussian mixture models. The proposed DPPM models are Bayesian nonparametric parsimonious mixture models that allow to simultaneously infer the model parameters, the optimal number of mixture components and the optimal parsimonious mixture structure from the data. We develop a Gibbs sampling technique for maximum a posteriori (MAP) estimation of the developed DPMM models and provide a Bayesian model selection framework by using Bayes factors. We apply them to cluster simulated data and real data sets, and compare them to the standard parsimonious mixture models. The obtained results highlight the effectiveness of the proposed nonparametric parsimonious mixture models as a good nonparametric alternative for the parametric parsimonious models.


An Improvement to the Domain Adaptation Bound in a PAC-Bayesian context

arXiv.org Machine Learning

This paper provides a theoretical analysis of domain adaptation based on the PAC-Bayesian theory. We propose an improvement of the previous domain adaptation bound obtained by Germain et al. [1] in two ways. We first give another generalization bound tighter and easier to interpret. Moreover, we provide a new analysis of the constant term appearing in the bound that can be of high interest for developing new algorithmic solutions.


Random Bits Regression: a Strong General Predictor for Big Data

arXiv.org Machine Learning

We are interested in a general data - based prediction task: g iven a train ing data matrix ( TrX), a training outcome vector ( TrY) and a test data matrix ( TeX), predict test outcome vector (). In the era of big data, two practically conflicting challenges are eminent: (1) the prior knowledge on the subject (a lso known as domain specific knowledge) is largely insufficient; (2) computation and storage cost of big data is unaffordable. To meet these aforementioned challenge s, this paper is devoted to modeling large number of observations without domain specific k nowledge, using regression and classification. The methods widely used for regression and classification can be classified as: linear regression, k nearest neighbor (KNN) [1], support vector machine (SVM) [2], neural network (NN) [3, 4], extreme learning machine (ELM) [5], deep learning (DL) [6], random forest (RF) [7] and boosting (GBM) [8] among others . Each method performs well on some types of datasets but has its own limitations on others [9 - 12] . A method with reasonable performance on boarder, if not universe, datasets is highly desired .