Goto

Collaborating Authors

 dropout


Dirichlet-Based Monte Carlo Dropout for Uncertainty Estimation in Neural Networks

arXiv.org Machine Learning

Traditional neural networks provide deterministic predictions without inherent uncertainty estimates. While Bayesian Neural Networks (BNNs) offer a principled approach to uncertainty quantification, their computational complexity limits scalability. Monte Carlo (MC) Dropout, initially introduced as a regularization technique, has been shown to approximate Bayesian inference by enabling probabilistic modeling through multiple stochastic forward passes. In this work, we enhance uncertainty estimation in deep learning by integrating a Dirichlet-based framework within MC Dropout. Specifically, we leverage the formulation proposed by Sensoy et al. (2018), where class probabilities are modeled using a Dirichlet distribution, allowing for a more informative uncertainty representation. The proposed approach maintains the computational efficiency of MC Dropout while improving the quality of uncertainty estimates. We discuss the theoretical foundations of our method and compare it with existing uncertainty quantification techniques. The results highlight the effectiveness of the proposed method in producing well-calibrated uncertainty estimates, offering a practical solution for uncertainty-aware deep learning models.


Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos

arXiv.org Machine Learning

We develop a mean-field theory of dropout as a perturbation of critical signal propagation at the edge of chaos. Dropout shifts the perfect-alignment fixed point, making the depth scale for information propagation finite even at critical initialization. We derive critical and crossover scaling laws for correlation decay and establish that smooth activations and kinked, ReLU-like activations constitute distinct universality classes, with different critical exponents and a universal two-parameter scaling collapse in detuning and dropout strength. The distinction traces to the analytic structure of the correlation map: smooth activations admit a Taylor expansion near perfect alignment, while kinked activations develop a branch point with universal non-analyticity. As a corollary, the framework yields saturated dropout profiles under fixed budget; a rank-flow tie-breaker then selects front-loaded schedules, substantially reducing held-out test loss at no extra computational cost, with accuracy gains as a consistent secondary effect. We test the predictions in MLPs and Vision Transformers and discuss CNN/ResNet extensions.


When Does Gene Regulatory Network Inference Break? A Controlled Diagnostic Study of Causal and Correlational Methods on Single-Cell Data

arXiv.org Machine Learning

Despite theoretical advantages, causal methods for Gene Regulatory Network (GRN) inference from single-cell RNA-seq data consistently fail to match or outperform correlation-based baselines in many realistic benchmarks, a persistent puzzle which casts doubt on the value of causality for this task. We argue that existing benchmarks are insufficiently controlled to answer this question because they evaluate on real or semi-real data where multiple pathologies co-occur, confounding failure modes, and obscuring the specific conditions under which different inference methods excel or fail. To address this gap, we introduce a controlled diagnostic framework that isolates seven biologically motivated pathologies (dropout, latent confounders, cell-type mixing, feedback loops, network density, sample size, and pseudotime drift) and measure how six representative methods spanning three inference paradigms degrade as each pathology intensifies. Across 6,120 controlled experiments, we find that causal methods genuinely dominate in clean and structurally favorable regimes, but specific pathologies (notably dropout and latent confounders) selectively neutralize their advantages. We further introduce an errortype decomposition that reveals methods with similar aggregate accuracy commit qualitatively different errors. To probe whether single-pathology effects persist when multiple stressors co-occur, we perform an interaction sweep over the three most impactful pathologies and find that their joint effects are sub-additive, while also exposing density-conditional cross-overs invisible to single-dial analysis. Our findings offer a nuanced understanding of when and why different methods succeed or fail for GRN inference, providing actionable insights for method development and practical guidance for practitioners.3



White-Box Transformers via Sparse Rate Reduction

Neural Information Processing Systems

In this paper, we contend that the objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. The quality of the final representation can be measured by a unified objective function called sparse rate reduction. From this perspective, popular deep networks such as transformers can be naturally viewed as realizing iterative schemes to optimize this objective incrementally. Particularly, we show that the standard transformer block can be derived from alternating optimization on complementary parts of this objective: the multi-head self-attention operator can be viewed as a gradient descent step to compress the token sets by minimizing their lossy coding rate, and the subsequent multi-layer perceptron can be viewed as attempting to sparsify the representation of the tokens. This leads to a family of white-box transformer-like deep network architectures which are mathematically fully interpretable. Despite their simplicity, experiments show that these networks indeed learn to optimize the designed objective: they compress and sparsify representations of large-scale real-world vision datasets such as ImageNet, and achieve performance very close to thoroughly engineered transformers such as ViT.



1f9f9d8ff75205aa73ec83e543d8b571-Supplemental.pdf

Neural Information Processing Systems

We repeat the theorems presented in Sec. 3 and provide their proofs below. The theorems hold for Neumann boundary conditions, which we use in our implementation--this is achieved by the construction of the differential operators. The proofs follow the ones presented in [22]. If the activation function σ() is monotonically non-decreasing and sign-preserving, then the forward propagation through the diffusive PDE in (1) for t [0,) yields a non-increasing feature norm, that is, t kfk2 0. Proof. Let us examine the following inner product following Eq.


Learning Conjoint Attentions for Graph Neural Nets Supplementary Materials

Neural Information Processing Systems

To prove Theorem 1, we need to consider the two directions of the iff conditions. If we are given h(c1,X1) = h(c2,X2), we are able to prove that the conditions mentioned in the theorem are necessary by showing contradictions occur when they are not satisfied. As Eq. (4) equals Eq. (6), we have: X Obviously, the above equation does not hold as the terms in the summation operator are positive. We may now assume S1 = S2 = S. Eliminating the irrational terms in Eq. (4), we have: X Eq. (9) can be simplified and rewritten as: µ1(x) µ2(x) = However, the RHS of Eq. (10) can be an irrational number. It is obvious that the above equality does not hold as the RHS is an irrational number, while LHS is a rational number.


Diffused Redundancy

Neural Information Processing Systems

A.1 CKADefinition In all our evaluations we use CKA with a linear kernel [24] which essentially amounts to the following steps: A.2 Additional CKA results Fig 9 shows CKA comparison between randomly chosen parts of the layer and the full layer for different kinds of ResNet50. We observe that even ResNet50 trained with MRL loss shows a significant amount of diffused redundancy. Figure 9: [Comparison of Diffused Redundancy in MRL vs other losses, through the lens of CKA] We see a similar trend as reported in Fig 7 in the main paper, where even the MRL model shows a significant amount of diffused redundancy despite being explicitly trained to instead have structured redundancy. The amount of diffused redundancy however is much lesser than the resnets trained using the standard loss and adv. Here we list the sources of weights for the various pre-trained models used in our experiments: ResNet18 trained on ImageNet1k using standard loss: taken from timmv0.6.1.


Diffused Redundancy in Pre-trained Representations

Neural Information Processing Systems

Representations learned by pre-training a neural network on a large dataset are increasingly used successfully to perform a variety of downstream tasks. In this work, we take a closer look at how features are encoded in such pre-trained representations. We find that learned representations in a given layer exhibit a degree of diffuse redundancy, i.e., any randomly chosen subset of neurons in the layer that is larger than a threshold size shares a large degree of similarity with the full layer and is able to perform similarly as the whole layer on a variety of downstream tasks. For example, a linear probe trained on 20% of randomly picked neurons from the penultimate layer of a ResNet50 pre-trained on ImageNet1k achieves an accuracy within 5% of a linear probe trained on the full layer of neurons for downstream CIFAR10 classification. We conduct experiments on different neural architectures (including CNNs and Transformers) pretrained on both ImageNet1k and ImageNet21k and evaluate a variety of downstream tasks taken from the VTAB benchmark. We find that the loss & dataset used during pre-training largely govern the degree of diffuse redundancy and the "critical mass" of neurons needed often depends on the downstream task, suggesting that there is a task-inherent redundancy-performance Pareto frontier. Our findings shed light on the nature of representations learned by pre-trained deep neural networks and suggest that entire layers might not be necessary to perform many downstream tasks. We investigate the potential for exploiting this redundancy to achieve efficient generalization for downstream tasks and also draw caution to certain possible unintended consequences.