Goto

Collaborating Authors

 sigmoid 0


Appendix

Neural Information Processing Systems

Weheldoutavalidation setfromthetraining set,andusedthisvalidation settoselecttheL2 regularization hyperparameter,which weselected from 45logarithmically spaced values between 10 6 and 105, applied to the sum of the per-example losses. Because the optimization problem is convex, we used the previous weights as a warm start as we increased theL2 regularization hyperparameter. Wemeasured eithertop-1ormean per-class accuracy, depending on which was suggested by the dataset creators. A.3 Fine-tuning In our fine-tuning experiments in Table 2, we used standard ImageNet-style data augmentationand trained for 20,000 steps with SGD with momentum of0.9 and cosine annealing [ 20]without restarts. Each curve represents a different model.


Supplementary Contents

Neural Information Processing Systems

Theauthors T admits singularvalue( i, i, i)i2I forsomeI, with1= 0 1 .. , i :X!Rand i :Z!R, i.e.T i = i iandT i = i i. Moreo T operatoras: Th= X Figure 5: Estimated rbfkernelwith =.1and1000samples. Thentheestimatorpresented in Equation(4), satisfiesthatw.p.1 : kT(ห†h h0)k2 O r rs log (pn) n + r log ( 1/ ) n !


TwinTURBO: Semi-Supervised Fine-Tuning of Foundation Models via Mutual Information Decompositions for Downstream Task and Latent Spaces

arXiv.org Machine Learning

Foundation models are large-scale neural networks pre-trained on diverse data to learn generalpurpose representations that can be fine-tuned for specific downstream tasks. This poses significant challenges, especially in the case of low-labelled data, a semi-supervised learning setting where only a small fraction of the data samples are labelled, while the majority remain unlabelled. While foundation models are pre-trained on large datasets in a self-supervised manner, their deployment often requires fine-tuning on new datasets with limited labelled samples and potential distribution shifts. Furthermore, the downstream tasks frequently differ from the pre-training objectives, complicating the adaptation process. Existing semi-supervised approaches, such as pseudo-labelling, rely heavily on assumptions about data distributions or task-specific tuning, limiting their generalisability. Addressing these challenges is essential to fully exploit the potential of foundation models and ensure their adaptability and scalability in diverse applications. The main contributions of this study are: A new framework for foundation models fine-tuning: We introduces a fine-tuning strategy based on mutual information decomposition.


Automated Design of Linear Bounding Functions for Sigmoidal Nonlinearities in Neural Networks

arXiv.org Artificial Intelligence

The ubiquity of deep learning algorithms in various applications has amplified the need for assuring their robustness against small input perturbations such as those occurring in adversarial attacks. Existing complete verification techniques offer provable guarantees for all robustness queries but struggle to scale beyond small neural networks. To overcome this computational intractability, incomplete verification methods often rely on convex relaxation to over-approximate the nonlinearities in neural networks. Progress in tighter approximations has been achieved for piecewise linear functions. However, robustness verification of neural networks for general activation functions (e.g., Sigmoid, Tanh) remains under-explored and poses new challenges. Typically, these networks are verified using convex relaxation techniques, which involve computing linear upper and lower bounds of the nonlinear activation functions. In this work, we propose a novel parameter search method to improve the quality of these linear approximations. Specifically, we show that using a simple search method, carefully adapted to the given verification problem through state-of-the-art algorithm configuration techniques, improves the average global lower bound by 25% on average over the current state of the art on several commonly used local robustness verification benchmarks.


Machine Learning Techniques with Fairness for Prediction of Completion of Drug and Alcohol Rehabilitation

arXiv.org Artificial Intelligence

The aim of this study is to look at predicting whether a person will complete a drug and alcohol rehabilitation program and the number of times a person attends. The study is based on demographic data obtained from Substance Abuse and Mental Health Services Administration (SAMHSA) from both admissions and discharge data from drug and alcohol rehabilitation centers in Oklahoma. Demographic data is highly categorical which led to binary encoding being used and various fairness measures being utilized to mitigate bias of nine demographic variables. Kernel methods such as linear, polynomial, sigmoid, and radial basis functions were compared using support vector machines at various parameter ranges to find the optimal values. These were then compared to methods such as decision trees, random forests, and neural networks. Synthetic Minority Oversampling Technique Nominal (SMOTEN) for categorical data was used to balance the data with imputation for missing data. The nine bias variables were then intersectionalized to mitigate bias and the dual and triple interactions were integrated to use the probabilities to look at worst case ratio fairness mitigation. Disparate Impact, Statistical Parity difference, Conditional Statistical Parity Ratio, Demographic Parity, Demographic Parity Ratio, Equalized Odds, Equalized Odds Ratio, Equal Opportunity, and Equalized Opportunity Ratio were all explored at both the binary and multiclass scenarios.


Kernelized Concept Erasure

arXiv.org Artificial Intelligence

The representation space of neural models for textual data emerges in an unsupervised manner during training. Understanding how those representations encode human-interpretable concepts is a fundamental problem. One prominent approach for the identification of concepts in neural representations is searching for a linear subspace whose erasure prevents the prediction of the concept from the representations. However, while many linear erasure algorithms are tractable and interpretable, neural networks do not necessarily represent concepts in a linear manner. To identify non-linearly encoded concepts, we propose a kernelization of a linear minimax game for concept erasure. We demonstrate that it is possible to prevent specific non-linear adversaries from predicting the concept. However, the protection does not transfer to different nonlinear adversaries. Therefore, exhaustively erasing a non-linearly encoded concept remains an open problem.


Mode-wise Principal Subspace Pursuit and Matrix Spiked Covariance Model

arXiv.org Artificial Intelligence

In modern scientific applications, data are often observed in the form of multiple matrices or tensors that pertain to different subjects from a certain population. For instance, longitudinal gene expression data consist of a matrix of gene expression levels across time for each subject (Liu et al., 2017); MRI imaging data contain one order-3 tensor image for each patient (Zhou et al., 2013); multilayer network can be represented by an order-3 tensor, where each layer (i.e., a matrix) represents one network (Jing et al., 2021); m-uniform hypergraph is typically viewed as an order-m tensor, whose entries denote all hyper-edges (Zhen & Wang, 2022); atomicresolution 4D scanning transmission electron microscopy data can be expressed as an order-3 tensor with two models denoting scan location and the other denoting the convergent beam electron diffraction pattern (Zhang et al., 2020). Combining information from all subjects results in a high-order tensor with subject independence along one mode and some covariance structure along the other modes that represent the relationship among the measured covariates. Principal Component Analysis (PCA) is a widely accepted method for analyzing data consisting of vectors associated with individual subjects. Its primary objective is to identify a lower-dimensional subspace within the feature domain that captures the majority of data variance (Pearson, 1901).


Dimension-Free Average Treatment Effect Inference with Deep Neural Networks

arXiv.org Machine Learning

This paper investigates the estimation and inference of the average treatment effect (ATE) using deep neural networks (DNNs) in the potential outcomes framework. Under some regularity conditions, the observed response can be formulated as the response of a mean regression problem with both the confounding variables and the treatment indicator as the independent variables. Using such formulation, we investigate two methods for ATE estimation and inference based on the estimated mean regression function via DNN regression using a specific network architecture. We show that both DNN estimates of ATE are consistent with dimension-free consistency rates under some assumptions on the underlying true mean regression model. Our model assumptions accommodate the potentially complicated dependence structure of the observed response on the covariates, including latent factors and nonlinear interactions between the treatment indicator and confounding variables. We also establish the asymptotic normality of our estimators based on the idea of sample splitting, ensuring precise inference and uncertainty quantification. Simulation studies and real data application justify our theoretical findings and support our DNN estimation and inference methods.


Eliminating Multicollinearity Issues in Neural Network Ensembles: Incremental, Negatively Correlated, Optimal Convex Blending

arXiv.org Artificial Intelligence

Given a {features, target} dataset, we introduce an incremental algorithm that constructs an aggregate regressor, using an ensemble of neural networks. It is well known that ensemble methods suffer from the multicollinearity issue, which is the manifestation of redundancy arising mainly due to the common training-dataset. In the present incremental approach, at each stage we optimally blend the aggregate regressor with a newly trained neural network under a convexity constraint which, if necessary, induces negative correlations. Under this framework, collinearity issues do not arise at all, rendering so the method both accurate and robust.


Hebbian-Descent

arXiv.org Machine Learning

In this work we propose Hebbian-descent as a biologically plausible learning rule for hetero-associative as well as auto-associative learning in single layer artificial neural networks. It can be used as a replacement for gradient descent as well as Hebbian learning, in particular in online learning, as it inherits their advantages while not suffering from their disadvantages. We discuss the drawbacks of Hebbian learning as having problems with correlated input data and not profiting from seeing training patterns several times. For gradient descent we identify the derivative of the activation function as problematic especially in online learning. Hebbian-descent addresses these problems by getting rid of the activation function's derivative and by centering, i.e. keeping the neural activities mean free, leading to a biologically plausible update rule that is provably convergent, does not suffer from the vanishing error term problem, can deal with correlated data, profits from seeing patterns several times, and enables successful online learning when centering is used. We discuss its relationship to Hebbian learning, contrastive learning, and gradient decent and show that in case of a strictly positive derivative of the activation function Hebbian-descent leads to the same update rule as gradient descent but for a different loss function. In this case Hebbian-descent inherits the convergence properties of gradient descent, but we also show empirically that it converges when the derivative of the activation function is only non-negative, such as for the step function for example. Furthermore, in case of the mean squared error loss Hebbian-descent can be understood as the difference between two Hebb-learning steps, which in case of an invertible and integrable activation function actually optimizes a generalized linear model. ...