Goto

Collaborating Authors

 Rangamani, Akshay


Low Rank and Sparse Fourier Structure in Recurrent Networks Trained on Modular Addition

arXiv.org Machine Learning

Low Rank and Sparse Fourier Structure in Recurrent Networks Trained on Modular Addition Akshay Rangamani Dept. of Data Science New Jersey Institute of T echnology Newark, NJ, USA akshay.rangamani@njit.edu Abstract --Modular addition tasks serve as a useful test bed for observing empirical phenomena in deep learning, including the phenomenon of grokking. Prior work has shown that one-layer transformer architectures learn Fourier Multiplication circuits to solve modular addition tasks. In this paper, we show that Recurrent Neural Networks (RNNs) trained on modular addition tasks also use a Fourier Multiplication strategy. We identify low rank structures in the model weights, and attribute model components to specific Fourier frequencies, resulting in a sparse representation in the Fourier space. We also show empirically that the RNN is robust to removing individual frequencies, while the performance degrades drastically as more frequencies are ablated from the model.


On Generalization Bounds for Neural Networks with Low Rank Layers

arXiv.org Machine Learning

Deep learning has achieved remarkable success across a wide range of applications, including computer vision[2, 3], natural language processing [4, 5], decision-making in novel environments [6], and code generation [7], among others. Understanding the reasons behind the effectiveness of deep learning is a multifaceted challenge that involves questions about architectural choices, optimizer selection, and the types of inductive biases that can guarantee generalization. A long-standing question in this field is how deep learning finds solutions that generalize well. While good generalization performance by overparameterized models is not unique to deep learning--it can be explained by the implicit bias of learning algorithms towards low-norm solutions in linear models and kernel machines [8, 9]--the case of deep learning presents additional challenges. However in the case of deep learning, identifying the right implicit bias and obtaining generalization bounds that depend on this bias are still open questions. In recent years, Rademacher bounds have been developed to explain the complexity control induced by an important bias in deep network training: the minimization of weight matrix norms. This minimization occurs due to explicit or implicit regularization [10, 11, 12, 13]. For rather general network architectures, Golowich et al.[14] showed that the Rademacher complexity is linear in the product of the Frobenius norms of the various layers. Although the associated bounds are usually orders of magnitude larger than the generalization gap for dense networks, very recent results by Galanti et al. [15] demonstrate that for networks with structural sparsity in their weight matrices, such as convolutional networks, norm-based Rademacher bounds approach non-vacuity.


Neural-guided, Bidirectional Program Search for Abstraction and Reasoning

arXiv.org Artificial Intelligence

One of the challenges facing artificial intelligence research today is designing systems capable of utilizing systematic reasoning to generalize to new tasks. The Abstraction and Reasoning Corpus (ARC) measures such a capability through a set of visual reasoning tasks. In this paper we report incremental progress on ARC and lay the foundations for two approaches to abstraction and reasoning not based in brute-force search. We first apply an existing program synthesis system called DreamCoder to create symbolic abstractions out of tasks solved so far, and show how it enables solving of progressively more challenging ARC tasks. Second, we design a reasoning algorithm motivated by the way humans approach ARC. Our algorithm constructs a search graph and reasons over this graph structure to discover task solutions. More specifically, we extend existing execution-guided program synthesis approaches with deductive reasoning based on function inverse semantics to enable a neural-guided bidirectional search algorithm. We demonstrate the effectiveness of the algorithm on three domains: ARC, 24-Game tasks, and a 'double-and-add' arithmetic puzzle.


For interpolating kernel machines, minimizing the norm of the ERM solution minimizes stability

arXiv.org Machine Learning

We study the average $\mbox{CV}_{loo}$ stability of kernel ridge-less regression and derive corresponding risk bounds. We show that the interpolating solution with minimum norm minimizes a bound on $\mbox{CV}_{loo}$ stability, which in turn is controlled by the condition number of the empirical kernel matrix. The latter can be characterized in the asymptotic regime where both the dimension and cardinality of the data go to infinity. Under the assumption of random kernel matrices, the corresponding test error should be expected to follow a double descent curve.


A Scale Invariant Flatness Measure for Deep Network Minima

arXiv.org Machine Learning

It has been empirically observed that the flatness of minima obtained from training deep networks seems to correlate with better generalization. However, for deep networks with positively homogeneous activations, most measures of sharpness/flatness are not invariant to rescaling of the network parameters, corresponding to the same function. This means that the measure of flatness/sharpness can be made as small or as large as possible through rescaling, rendering the quantitative measures meaningless. In this paper we show that for deep networks with positively homogenous activations, these rescalings constitute equivalence relations, and that these equivalence relations induce a quotient manifold structure in the parameter space. Using this manifold structure and an appropriate metric, we propose a Hessian-based measure for flatness that is invariant to rescaling. We use this new measure to confirm the proposition that Large-Batch SGD minima are indeed sharper than Small-Batch SGD minima.


Automated software vulnerability detection with machine learning

arXiv.org Machine Learning

Thousands of security vulnerabilities are discovered in production software each year, either reported publicly to the Common Vulnerabilities and Exposures database or discovered internally in proprietary code. Vulnerabilities often manifest themselves in subtle ways that are not obvious to code reviewers or the developers themselves. With the wealth of open source code available for analysis, there is an opportunity to learn the patterns of bugs that can lead to security vulnerabilities directly from data. In this paper, we present a data-driven approach to vulnerability detection using machine learning, specifically applied to C and C++ programs. We first compile a large dataset of hundreds of thousands of open-source functions labeled with the outputs of a static analyzer. We then compare methods applied directly to source code with methods applied to artifacts extracted from the build process, finding that source-based models perform better. We also compare the application of deep neural network models with more traditional models such as random forests and find the best performance comes from combining features learned by deep models with tree-based models. Ultimately, our highest performing model achieves an area under the precision-recall curve of 0.49 and an area under the ROC curve of 0.87.


Sparse Coding and Autoencoders

arXiv.org Machine Learning

In "Dictionary Learning" one tries to recover incoherent matrices $A^* \in \mathbb{R}^{n \times h}$ (typically overcomplete and whose columns are assumed to be normalized) and sparse vectors $x^* \in \mathbb{R}^h$ with a small support of size $h^p$ for some $0