Goto

Collaborating Authors

 batch norm



Re Reviewer 1

Neural Information Processing Systems

About ResNets' using convolutional layers (Conv) and batch norms(BN) in practice: we acknowledge that Conv and BN Thank you for the positive comments. About generalizability: considering kernel ridge regression, one can show that the generalization error is "continuous" The same applies to the poor generalization of deep FFNets. About our motivation: our increasing depth analysis is a common practice in theoretical research. This is because the infinite-depth behavior is very similar to the large-depth behavior of NNs. Moreover, we highlight that existing results are not limited to 50 and 100 layers.


Re Reviewer 1

Neural Information Processing Systems

About ResNets' using convolutional layers (Conv) and batch norms(BN) in practice: we acknowledge that Conv and BN Thank you for the positive comments. About generalizability: considering kernel ridge regression, one can show that the generalization error is "continuous" The same applies to the poor generalization of deep FFNets. About our motivation: our increasing depth analysis is a common practice in theoretical research. This is because the infinite-depth behavior is very similar to the large-depth behavior of NNs. Moreover, we highlight that existing results are not limited to 50 and 100 layers.


Review for NeurIPS paper: Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel

Neural Information Processing Systems

Additional Feedback: Minor issues *Visualization method of Figure 1: I am not sure how the authors depict this paper. Is it based on PCA of trajectories? It is also unclear why linear lines give these trajectories. It is just a linear regression with the Taylorized model (2). More technically speaking, when we use data-dependent NTK in a linearized model, the positive definiteness of this NTK is non-trivial and the equivalence to the kernel regression becomes unclear.


Reviews: Scalable methods for 8-bit training of neural networks

Neural Information Processing Systems

This is interesting given the fact that most of the existing works are based on 16 bit and people are having some difficulties in training 8bit models. The paper identified that the training difficulty comes from batchnorm and it proposed a variant of batchnorm called range batchnorm which alleviate the numerical instability of the original batchnorm occurring with the quantized models. By such simple modification, the paper shows that a 8 bit model can be easily trained using GEMMLOWP, an existing framework. The paper also tried to analyze and understand the proposed approach in a theoretical manner. Experiments well supported the argument of the paper.


Medical Image Segmentation with InTEnt: Integrated Entropy Weighting for Single Image Test-Time Adaptation

Dong, Haoyu, Konz, Nicholas, Gu, Hanxue, Mazurowski, Maciej A.

arXiv.org Artificial Intelligence

Test-time adaptation (TTA) refers to adapting a trained model to a new domain during testing. Existing TTA techniques rely on having multiple test images from the same domain, yet this may be impractical in real-world applications such as medical imaging, where data acquisition is expensive and imaging conditions vary frequently. Here, we approach such a task, of adapting a medical image segmentation model with only a single unlabeled test image. Most TTA approaches, which directly minimize the entropy of predictions, fail to improve performance significantly in this setting, in which we also observe the choice of batch normalization (BN) layer statistics to be a highly important yet unstable factor due to only having a single test domain example. To overcome this, we propose to instead integrate over predictions made with various estimates of target domain statistics between the training and test statistics, weighted based on their entropy statistics. Our method, validated on 24 source/target domain splits across 3 medical image datasets surpasses the leading method by 2.9% Dice coefficient on average.


Strategies to exploit XAI to improve classification systems

Apicella, Andrea, Di Lorenzo, Luca, Isgrò, Francesco, Pollastro, Andrea, Prevete, Roberto

arXiv.org Artificial Intelligence

Explainable Artificial Intelligence (XAI) aims to provide insights into the decision-making process of AI models, allowing users to understand their results beyond their decisions. A significant goal of XAI is to improve the performance of AI models by providing explanations for their decision-making processes. However, most XAI literature focuses on how to explain an AI system, while less attention has been given to how XAI methods can be exploited to improve an AI system. In this work, a set of well-known XAI methods typically used with Machine Learning (ML) classification tasks are investigated to verify if they can be exploited, not just to provide explanations but also to improve the performance of the model itself. To this aim, two strategies to use the explanation to improve a classification system are reported and empirically evaluated on three datasets: Fashion-MNIST, CIFAR10, and STL10. Results suggest that explanations built by Integrated Gradients highlight input features that can be effectively used to improve classification performance.


Assessment of few-hits machine learning classification algorithms for low energy physics in liquid argon detectors

Biassoni, Matteo, Giachero, Andrea, Grossi, Michele, Guffanti, Daniele, Labranca, Danilo, Moretti, Roberto, Rossi, Marco, Terranova, Francesco, Vallecorsa, Sofia

arXiv.org Artificial Intelligence

The physics potential of massive liquid argon TPCs in the low-energy regime is still to be fully reaped because few-hits events encode information that can hardly be exploited by conventional classification algorithms. Machine learning (ML) techniques give their best in these types of classification problems. In this paper, we evaluate their performance against conventional (deterministic) algorithms. We demonstrate that both Convolutional Neural Networks (CNN) and Transformer-Encoder methods outperform deterministic algorithms in one of the most challenging classification problems of low-energy physics (single- versus double-beta events). We discuss the advantages and pitfalls of Transformer-Encoder methods versus CNN and employ these methods to optimize the detector parameters, with an emphasis on the DUNE Phase II detectors ("Module of Opportunity").


Batch Norm Explained Visually -- How it works, and why neural networks need it

#artificialintelligence

Batch Norm is an essential part of the toolkit of the modern deep learning practitioner. Soon after it was introduced in the Batch Normalization paper, it was recognized as being transformational in creating deeper neural networks that could be trained faster. Batch Norm is a neural network layer that is now commonly used in many architectures. It often gets added as part of a Linear or Convolutional block and helps to stabilize the network during training. In this article, we will explore what Batch Norm is, why we need it and how it works.


A Deep Neural Network Based Reverse Radio Spectrogram Search Algorithm

Ma, Peter Xiangyuan, Croft, Steve, Siemion, Andrew P. V.

arXiv.org Artificial Intelligence

We developed a fast and modular deep learning algorithm to search for lookalike signals of interest in radio spectrogram data. First, we trained an autoencoder on filtered data returned by an energy detection algorithm. We then adapted a positional embedding layer from classical Transformer architecture to a frequency-based embedding. Next we used the encoder component of the autoencoder to extract features from small ( 715 Hz with a resolution of 2.79 Hz per frequency bin) windows in the radio spectrogram. We used our algorithm to conduct a search for a given query (encoded signal of interest) on a set of signals (encoded features of searched items) to produce the top candidates with similar features. We successfully demonstrate that the algorithm retrieves signals with similar appearance, given only the original radio spectrogram data.