Goto

Collaborating Authors

 Neural Information Processing Systems


Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning

Neural Information Processing Systems

We study the asynchronous stochastic gradient descent algorithm for distributed training over n workers which have varying computation and communication frequency over time. In this algorithm, workers compute stochastic gradients in parallel at their own pace and return those to the server without any synchronization.


Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

Neural Information Processing Systems

Recent works (e.g., (Li and Arora, 2020)) suggest that the use of popular normalization schemes (including Batch Normalization) in today's deep learning can move it far from a traditional optimization viewpoint, e.g., use of exponentially increasing learning rates. The current paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints, and then initiates a formal framework for studying their mathematics via suitable adaptation of the conventional framework namely, modeling SGD-induced training trajectory via a suitable stochastic differential equation (SDE) with a noise term that captures gradient noise. This yields: (a) A new "intrinsic learning rate" parameter that is the product of the normal learning rate η and weight decay factor λ. Analysis of the SDE shows how the effective speed of learning varies and equilibrates over time under the control of intrinsic LR.


Feature-fortified Unrestricted Graph Alignment

Neural Information Processing Systems

The necessity to align two graphs, minimizing a structural distance metric, is prevalent in biology, chemistry, recommender systems, and social network analysis. Due to the problem's NP-hardness, prevailing graph alignment methods follow a modular and mediated approach, solving the problem restricted to the domain of intermediary graph representations or products like embeddings, spectra, and graph signals. Restricting the problem to this intermediate space may distort the original problem and are hence predisposed to miss high-quality solutions.


Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness Long Zhao 1 Ting Liu 2 Xi Peng 3

Neural Information Processing Systems

Adversarial data augmentation has shown promise for training robust deep neural networks against unforeseen data shifts or corruptions. However, it is difficult to define heuristics to generate effective fictitious target distributions containing "hard" adversarial perturbations that are largely different from the source distribution. In this paper, we propose a novel and effective regularization term for adversarial data augmentation. We theoretically derive it from the information bottleneck principle, which results in a maximum-entropy formulation. Intuitively, this regularization term encourages perturbing the underlying source distribution to enlarge predictive uncertainty of the current model, so that the generated "hard" adversarial perturbations can improve the model robustness during training. Experimental results on three standard benchmarks demonstrate that our method consistently outperforms the existing state of the art by a statistically significant margin.


a novel constraint optimization method to encode the generic knowledge into a BN without requiring any training data

Neural Information Processing Systems

Our proposed approach can be applied to other AUs as well. In Tab.6, LP-SM also considers apex frames on CK+, and The comparison to LP-SM is consistent. In Tab.8, we apply FMPN-FER and DeepEmotion to our pre-processed We will consider a pre-trained VGGFace model in our further work. R2 2.1 The novelty compared to prior work. Facial expression can be a group of AUs.


A Appendix

Neural Information Processing Systems

A.1 Speech Translation Evaluation One hyperparameter in our speech translation evaluation is the threshold on the alignment scores. Mined speech-text pairs are included in the train set if their alignment scores are greater than or equal to the threshold. Speech translation models are trained on the combination of CoVoST2 train set and mined data at different thresholds. We report the performance of each model on the dev set of Common Voice in Figure 5, and find the optimal value for the threshold. Figure 5: BLEU on dev set achieved by S2T models trained on CoVoST train set + mined data at different thresholds.


Multimodal and Multilingual Embeddings for Large-Scale Speech Mining

Neural Information Processing Systems

We present an approach to encode a speech signal into a fixed-size representation which minimizes the cosine loss with the existing massively multilingual LASER text embedding space. Sentences are close in this embedding space, independently of their language and modality, either text or audio. Using a similarity metric in that multimodal embedding space, we perform mining of audio in German, French, Spanish and English from Librivox against billions of sentences from Common Crawl. This yielded more than twenty thousand hours of aligned speech translations. To evaluate the automatically mined speech/text corpora, we train neural speech translation systems for several languages pairs.


Calibration of Shared Equilibria in General Sum Partially Observable Markov Games - Supplementary Nelson Vadori, Sumitra Ganesh, Prashant Reddy, Manuela Veloso J.P. Morgan AI Research A Proofs, y

Neural Information Processing Systems

B.4 Complete set of experimental results associated to section 4 In this section we display the complete set of results associated to figures shown in section 4. We display in figure 2 the rewards of all agents during training (calibrator, merchant on supertype 1 and n 1 merchants on supertype 2) for experiments 1-5 previously described.


Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation

Neural Information Processing Systems

We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure, i.e., they process an image as a sequence of image patches. We find that ViTs are surprisingly insensitive to patchbased transformations, even when the transformation largely destroys the original semantics and makes the image unrecognizable by humans. This indicates that ViTs heavily use features that survived such transformations but are generally not indicative of the semantic class to humans. Further investigations show that these features are useful but non-robust, as ViTs trained on them can achieve high in-distribution accuracy, but break down under distribution shifts. From this understanding, we ask: can training the model to rely less on these features improve ViT robustness and out-of-distribution performance? We use the images transformed with our patch-based operations as negatively augmented views and offer losses to regularize the training away from using non-robust features. This is a complementary view to existing research that mostly focuses on augmenting inputs with semantic-preserving transformations to enforce models' invariance. We show that patch-based negative augmentation consistently improves robustness of ViTs on ImageNet based robustness benchmarks across 20+ different experimental settings. Furthermore, we find our patch-based negative augmentation are complementary to traditional (positive) data augmentation techniques and batchbased negative examples in contrastive learning.


supplementary material for paper: Constant-Expansion Suffices for Compressed Sensing with Generative Priors

Neural Information Processing Systems

In this section we prove Theorem 3.2. The two arguments are essentially identical, and we will focus on the former. See [20] for a reference on the first bound. The second bound is by concentration of chisquared with k degrees of freedom. We check that f and g satisfy the three conditions of Theorem 4.4 with appropriate parameters. Finally, since Pr[W Θ] 1/2, it follows that conditioning on Θ at most doubles the failure probability.