Collaborating Authors

Distance and Equivalence between Finite State Machines and Recurrent Neural Networks: Computational results Machine Learning

The need of interpreting Deep Learning (DL) models has led, during the past years, to a proliferation of works concerned by this issue. Among strategies which aim at shedding some light on how information is represented internally in DL models, one consists in extracting symbolic rule-based machines from connectionist models that are supposed to approximate well their behaviour. In order to better understand how reasonable these approximation strategies are, we need to know the computational complexity of measuring the quality of approximation. In this article, we will prove some computational results related to the problem of extracting Finite State Machine (FSM) based models from trained RNN Language models. More precisely, we'll show the following: (a) For general weighted RNN-LMs with a single hidden layer and a ReLu activation: - The equivalence problem of a PDFA/PFA/WFA and a weighted first-order RNN-LM is undecidable; - As a corollary, the distance problem between languages generated by PDFA/PFA/WFA and that of a weighted RNN-LM is not recursive; -The intersection between a DFA and the cut language of a weighted RNN-LM is undecidable; - The equivalence of a PDFA/PFA/WFA and weighted RNN-LM in a finite support is EXP-Hard; (b) For consistent weight RNN-LMs with any computable activation function: - The Tcheybechev distance approximation is decidable; - The Tcheybechev distance approximation in a finite support is NP-Hard. Moreover, our reduction technique from 3-SAT makes this latter fact easily generalizable to other RNN architectures (e.g. LSTMs/RNNs), and RNNs with finite precision.

Neural Network Gradients: Backpropagation, Dual Numbers, Finite Differences


In the post How to Train Neural Networks With Backpropagation I said that you could also calculate the gradient of a neural network by using dual numbers or finite differences. The post I already linked to explains backpropagation. Since the fundamentals are explained in the links above, we'll go straight to the code. We'll be getting the gradient (learning values) for the network in example 4 in the backpropagation post: Note that I am using "central differences" for the gradient, but it would be more efficient to do a forward or backward difference, at the cost of some accuracy. I didn't compare the running times of each method as my code is meant to be readable, not fast, and the code isn't doing enough work to make a meaningful performance test IMO.

My Brain is Full: When More Memory Helps Artificial Intelligence

We consider the problem of finding good finite-horizon policies for POMDPs under the expected reward metric. The policies considered are {em free finite-memory policies with limited memory}; a policy is a mapping from the space of observation-memory pairs to the space of action-memeory pairs (the policy updates the memory as it goes), and the number of possible memory states is a parameter of the input to the policy-finding algorithms. The algorithms considered here are preliminary implementations of three search heuristics: local search, simulated annealing, and genetic algorithms. We compare their outcomes to each other and to the optimal policies for each instance. We compare run times of each policy and of a dynamic programming algorithm for POMDPs developed by Hansen that iteratively improves a finite-state controller --- the previous state of the art for finite memory policies. The value of the best policy can only improve as the amount of memory increases, up to the amount needed for an optimal finite-memory policy. Our most surprising finding is that more memory helps in another way: given more memory than is needed for an optimal policy, the algorithms are more likely to converge to optimal-valued policies.

Finite sample expressive power of small-width ReLU networks Machine Learning

We study universal finite sample expressivity of neural networks, defined as the capability to perfectly memorize arbitrary datasets. For scalar outputs, existing results require a hidden layer as wide as $N$ to memorize $N$ data points. In contrast, we prove that a 3-layer (2-hidden-layer) ReLU network with $4 \sqrt {N}$ hidden nodes can perfectly fit any arbitrary dataset. For $K$-class classification, we prove that a 4-layer ReLU network with $4 \sqrt{N} + 4K$ hidden neurons can memorize arbitrary datasets. For example, a 4-layer ReLU network with only 8,000 hidden nodes can memorize datasets with $N$ = 1M and $K$ = 1k (e.g., ImageNet). Our results show that even small networks already have tremendous overfitting capability, admitting zero empirical risk for any dataset. We also extend our results to deeper and narrower networks, and prove converse results showing necessity of $\Omega(N)$ parameters for shallow networks.