Goto

Collaborating Authors

 Country


Solving Arithmetic Word Problems Automatically Using Transformer and Unambiguous Representations

arXiv.org Machine Learning

Constructing accurate and automatic solvers of math word problems has proven to be quite challenging. Prior attempts using machine learning have been trained on corpora specific to math word problems to produce arithmetic expressions in infix notation before answer computation. We find that custom-built neural networks have struggled to generalize well. This paper outlines the use of Transformer networks trained to translate math word problems to equivalent arithmetic expressions in infix, prefix, and postfix notations. In addition to training directly on domain-specific corpora, we use an approach that pre-trains on a general text corpus to provide foundational language abilities to explore if it improves performance. We compare results produced by a large number of neural configurations and find that most configurations outperform previously reported approaches on three of four datasets with significant increases in accuracy of over 20 percentage points. The best neural approaches boost accuracy by almost 10% on average when compared to the previous state of the art.


Efficient Relaxed Gradient Support Pursuit for Sparsity Constrained Non-convex Optimization

arXiv.org Machine Learning

Large-scale non-convex sparsity-constrained problems have recently gained extensive attention. Most existing deterministic optimization methods (e.g., GraSP) are not suitable for large-scale and high-dimensional problems, and thus stochastic optimization methods with hard thresholding (e.g., SVRGHT) become more attractive. Inspired by GraSP, this paper proposes a new general relaxed gradient support pursuit (RGraSP) framework, in which the sub-algorithm only requires to satisfy a slack descent condition. We also design two specific semi-stochastic gradient hard thresholding algorithms. In particular, our algorithms have much less hard thresholding operations than SVRGHT, and their average per-iteration cost is much lower (i.e., O(d) vs. O(d log(d)) for SVRGHT), which leads to faster convergence. Our experimental results on both synthetic and real-world datasets show that our algorithms are superior to the state-of-the-art gradient hard thresholding methods.


CNNs, LSTMs, and Attention Networks for Pathology Detection in Medical Data

arXiv.org Machine Learning

For the weakly supervised task of electrocardiogram (ECG) rhythm classification, convolutional neural networks (CNNs) and long short-term memory (LSTM) networks are two increasingly popular classification models. This work investigates whether a combination of both architectures to so-called convolutional long short-term memory (ConvLSTM) networks can improve classification performances by explicitly capturing morphological as well as temporal features of raw ECG records. In addition, various attention mechanisms are studied to localize and visualize record sections of abnormal morphology and irregular rhythm. The resulting saliency maps are supposed to not only allow for a better network understanding but to also improve clinicians' acceptance of automatic diagnosis in order to avoid the technique being labeled as a black box. In further experiments, attention mechanisms are actively incorporated into the training process by learning a few additional attention gating parameters in a CNN model. An 8-fold cross validation is finally carried out on the PhysioNet Computing in Cardiology (CinC) challenge 2017 to compare the performances of standard CNN models, ConvLSTMs, and attention gated CNNs.


On the Delta Method for Uncertainty Approximation in Deep Learning

arXiv.org Machine Learning

The Delta method is a well known procedure used to quantify uncertainty in statistical models. The method has previously been applied in the context of neural networks, but has not reached much popularity in deep learning because of the sheer size of the Hessian matrix. In this paper, we propose a low cost variant of the method based on an approximate eigendecomposition of the positive curvature subspace of the Hessian matrix. The method has a computational complexity of $O(KPN)$ time and $O(KP)$ space, where $K$ is the number of utilized Hessian eigenpairs, $P$ is the number of model parameters and $N$ is the number of training examples. Given that the model is $L_2$-regularized with rate $\lambda/2$, we provide a bound on the uncertainty approximation error given $K$. We show that when the smallest Hessian eigenvalue in the positive $K/2$-tail of the full spectrum, and the largest Hessian eigenvalue in the negative $K/2$-tail of the full spectrum are both approximately equal to $\lambda$, the error will be close to zero even when $K\ll P$ . We demonstrate the method by a TensorFlow implementation, and show that meaningful rankings of images based on prediction uncertainty can be obtained for a convolutional neural network based MNIST classifier. We also observe that false positives have higher prediction uncertainty than true positives. This suggests that there is supplementing information in the uncertainty measure not captured by the probability alone.


A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning

arXiv.org Machine Learning

One of the distinguishing characteristics of modern deep learning systems is that they typically employ neural network architectures that utilize enormous numbers of parameters, often in the millions and sometimes even in the billions. While this paradigm has inspired significant research on the properties of large networks, relatively little work has been devoted to the fact that these networks are often used to model large complex datasets, which may themselves contain millions or even billions of constraints. In this work, we focus on this high-dimensional regime in which both the dataset size and the number of features tend to infinity. We analyze the performance of a simple regression model trained on the random features $F=f(WX+B)$ for a random weight matrix $W$ and random bias vector $B$, obtaining an exact formula for the asymptotic training error on a noisy autoencoding task. The role of the bias can be understood as parameterizing a distribution over activation functions, and our analysis directly generalizes to such distributions, even those not expressible with a traditional additive bias. Intriguingly, we find that a mixture of nonlinearities can outperform the best single nonlinearity on the noisy autoecndoing task, suggesting that mixtures of nonlinearities might be useful for approximate kernel methods or neural network architecture design.


Federated Learning with Personalization Layers

arXiv.org Machine Learning

The emerging paradigm of federated learning strives to enable collaborative training of machine learning models on the network edge without centrally aggregating raw data and hence, improving data privacy. This sharply deviates from traditional machine learning and necessitates the design of algorithms robust to various sources of heterogeneity. Specifically, statistical heterogeneity of data across user devices can severely degrade the performance of standard federated averaging for traditional machine learning applications like personalization with deep learning. This paper pro-posesFedPer, a base + personalization layer approach for federated training of deep feedforward neural networks, which can combat the ill-effects of statistical heterogeneity. We demonstrate effectiveness ofFedPerfor non-identical data partitions ofCIFARdatasetsand on a personalized image aesthetics dataset from Flickr.


Differential Bayesian Neural Nets

arXiv.org Machine Learning

Neural Ordinary Differential Equations (N-ODEs) are a powerful building block for learning systems, which extend residual networks to a continuous-time dynamical system. We propose a Bayesian version of N-ODEs that enables well-calibrated quantification of prediction uncertainty, while maintaining the expressive power of their deterministic counterpart. We assign Bayesian Neural Nets (BNNs) to both the drift and the diffusion terms of a Stochastic Differential Equation (SDE) that models the flow of the activation map in time. We infer the posterior on the BNN weights using a straightforward adaptation of Stochastic Gradient Langevin Dynamics (SGLD). We illustrate significantly improved stability on two synthetic time series prediction tasks and report better model fit on UCI regression benchmarks with our method when compared to its non-Bayesian counterpart.


Is Discriminator a Good Feature Extractor?

arXiv.org Machine Learning

Discriminator from generative adversarial nets (GAN) has been used by some research as feature extractor in transfer learning and worked well. But there are also some studies believed that this is a wrong research direction because intuitively the task of discriminator focuses on separating the real samples from the generated ones, making the feature extracted in this way useless for most of the downstream tasks. In this work, we find that the connection between the task of discriminator and the feature is not as strong as people thought, that the main factor restricting the feature learned by the discriminator is not the task of the discriminator itself, but the need to prevent the entire GAN model from mode collapse during the training. From this perspective and combined with further analyses, we find that to avoid mode collapse in the training process of GAN, the features extracted by the discriminator is not guided to be different for the real samples, but divergence without noise is indeed allowed and occupies a large proportion of the feature space. This makes the features learned more robust and helps answer the question about why discriminator can succeed as feature extractor in the related research. After these, we analyze the counterpart of the discriminator extractor, the classifier extractor that assigns the target samples to different categories. We find the performance of the discriminator extractor may be inferior to classifier based extractor when the source classification task is similar to the target task, which is a common case. But the ability to avoid noise prevents discriminator from being replaced by classifier. Last but not least, as our research also reveals a ratio playing an important role in GAN's training to prevent mode collapse, it may contribute to the basic GAN study.


scikit-hubness: Hubness Reduction and Approximate Neighbor Search

arXiv.org Machine Learning

This paper introduces scikit-hubness, a Python package for efficient nearest neighbor search in high-dimensional spaces. Hubness is an aspect of the curse of dimensionality, and is known to impair various learning tasks, including classification, clustering, and visualization. scikit-hubness provides algorithms for hubness analysis ("Is my data affected by hubness?"), hubness reduction ("How can we improve neighbor retrieval in high dimensions?"), and approximate neighbor search ("Does it work for large data sets?"). It is integrated into the scikit-learn environment, enabling rapid adoption by Python-based machine learning researchers and practitioners. Users will find all functionality of the scikit-learn neighbors package, plus additional support for transparent hubness reduction and approximate nearest neighbor search. scikit-hubness is developed using several quality assessment tools and principles, such as PEP8 compliance, unit tests with high code coverage, continuous integration on all major platforms (Linux, MacOS, Windows), and additional checks by LGTM. The source code is available at https://github.com/VarIr/scikit-hubness under the BSD 3-clause license. Install from the Python package index with $ pip install scikit-hubness.


ReD-CaNe: A Systematic Methodology for Resilience Analysis and Design of Capsule Networks under Approximations

arXiv.org Machine Learning

Recent advances in Capsule Networks (CapsNets) have shown their superior learning capability, compared to the traditional Convolutional Neural Networks (CNNs). However, the extremely high complexity of CapsNets limits their fast deployment in real-world applications. Moreover, while the resilience of CNNs have been extensively investigated to enable their energy-efficient implementations, the analysis of CapsNets' resilience is a largely unexplored area, that can provide a strong foundation to investigate techniques to overcome the CapsNets' complexity challenge. Following the trend of Approximate Computing to enable energy-efficient designs, we perform an extensive resilience analysis of the CapsNets inference subjected to the approximation errors. Our methodology models the errors arising from the approximate components (like multipliers), and analyze their impact on the classification accuracy of CapsNets. This enables the selection of approximate components based on the resilience of each operation of the CapsNet inference. We modify the TensorFlow framework to simulate the injection of approximation noise (based on the models of the approximate components) at different computational operations of the CapsNet inference. Our results show that the CapsNets are more resilient to the errors injected in the computations that occur during the dynamic routing (the softmax and the update of the coefficients), rather than other stages like convolutions and activation functions. Our analysis is extremely useful towards designing efficient CapsNet hardware accelerators with approximate components. To the best of our knowledge, this is the first proof-of-concept for employing approximations on the specialized CapsNet hardware.