Goto

Collaborating Authors

 Perceptrons


Review for NeurIPS paper: Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains

Neural Information Processing Systems

Using NTK theory the authors show that a standard multilayer perceptron fails to learn high frequencies both in theory and in practice. The authors then use a Fourier feature mapping to transform to overcome this bias. The experimental results also demonstrate that, with the same amount of training points (e,g. All reviewers thought the paper contains a very rich set of experiments and interesting numerical results. The reviewers raised various technical concerns in their reviews but thought that the authors' response adequately addressed these concerns and multiple reviewers raised their score.


MLPs at the EOC: Concentration of the NTK

arXiv.org Machine Learning

We study the concentration of the Neural Tangent Kernel (NTK) $K_\theta : \mathbb{R}^{m_0} \times \mathbb{R}^{m_0} \to \mathbb{R}^{m_l \times m_l}$ of $l$-layer Multilayer Perceptrons (MLPs) $N : \mathbb{R}^{m_0} \times \Theta \to \mathbb{R}^{m_l}$ equipped with activation functions $\phi(s) = a s + b \vert s \vert$ for some $a,b \in \mathbb{R}$ with the parameter $\theta \in \Theta$ being initialized at the Edge Of Chaos (EOC). Without relying on the gradient independence assumption that has only been shown to hold asymptotically in the infinitely wide limit, we prove that an approximate version of gradient independence holds at finite width. Showing that the NTK entries $K_\theta(x_{i_1},x_{i_2})$ for $i_1,i_2 \in [1:n]$ over a dataset $\{x_1,\cdots,x_n\} \subset \mathbb{R}^{m_0}$ concentrate simultaneously via maximal inequalities, we prove that the NTK matrix $K(\theta) = [\frac{1}{n} K_\theta(x_{i_1},x_{i_2}) : i_1,i_2 \in [1:n]] \in \mathbb{R}^{nm_l \times nm_l}$ concentrates around its infinitely wide limit $\overset{\scriptscriptstyle\infty}{K} \in \mathbb{R}^{nm_l \times nm_l}$ without the need for linear overparameterization. Our results imply that in order to accurately approximate the limit, hidden layer widths have to grow quadratically as $m_k = k^2 m$ for some $m \in \mathbb{N}+1$ for sufficient concentration. For such MLPs, we obtain the concentration bound $\mathbb{P}( \Vert K(\theta) - \overset{\scriptscriptstyle\infty}{K} \Vert \leq O((\Delta_\phi^{-2} + m_l^{\frac{1}{2}} l) \kappa_\phi^2 m^{-\frac{1}{2}})) \geq 1-O(m^{-1})$ modulo logarithmic terms, where we denoted $\Delta_\phi = \frac{b^2}{a^2+b^2}$ and $\kappa_\phi = \frac{\vert a \vert + \vert b \vert}{\sqrt{a^2 + b^2}}$. This reveals in particular that the absolute value ($\Delta_\phi=1$, $\kappa_\phi=1$) beats the ReLU ($\Delta_\phi=\frac{1}{2}$, $\kappa_\phi=\sqrt{2}$) in terms of the concentration of the NTK.


Local Control Networks (LCNs): Optimizing Flexibility in Neural Network Data Pattern Capture

arXiv.org Artificial Intelligence

The widespread use of Multi-layer perceptrons (MLPs) often relies on a fixed activation function (e.g., ReLU, Sigmoid, Tanh) for all nodes within the hidden layers. While effective in many scenarios, this uniformity may limit the networks ability to capture complex data patterns. We argue that employing the same activation function at every node is suboptimal and propose leveraging different activation functions at each node to increase flexibility and adaptability. To achieve this, we introduce Local Control Networks (LCNs), which leverage B-spline functions to enable distinct activation curves at each node. Our mathematical analysis demonstrates the properties and benefits of LCNs over conventional MLPs. In addition, we demonstrate that more complex architectures, such as Kolmogorov-Arnold Networks (KANs), are unnecessary in certain scenarios, and LCNs can be a more efficient alternative. Empirical experiments on various benchmarks and datasets validate our theoretical findings. In computer vision tasks, LCNs achieve marginal improvements over MLPs and outperform KANs by approximately 5\%, while also being more computationally efficient than KANs. In basic machine learning tasks, LCNs show a 1\% improvement over MLPs and a 0.6\% improvement over KANs. For symbolic formula representation tasks, LCNs perform on par with KANs, with both architectures outperforming MLPs. Our findings suggest that diverse activations at the node level can lead to improved performance and efficiency.


MLPs at the EOC: Spectrum of the NTK

arXiv.org Artificial Intelligence

We study the properties of the Neural Tangent Kernel (NTK) $\overset{\scriptscriptstyle\infty}{K} : \mathbb{R}^{m_0} \times \mathbb{R}^{m_0} \to \mathbb{R}^{m_l \times m_l}$ corresponding to infinitely wide $l$-layer Multilayer Perceptrons (MLPs) taking inputs from $\mathbb{R}^{m_0}$ to outputs in $\mathbb{R}^{m_l}$ equipped with activation functions $\phi(s) = a s + b \vert s \vert$ for some $a,b \in \mathbb{R}$ and initialized at the Edge Of Chaos (EOC). We find that the entries $\overset{\scriptscriptstyle\infty}{K}(x_1,x_2)$ can be approximated by the inverses of the cosine distances of the activations corresponding to $x_1$ and $x_2$ increasingly better as the depth $l$ increases. By quantifying these inverse cosine distances and the spectrum of the matrix containing them, we obtain tight spectral bounds for the NTK matrix $\overset{\scriptscriptstyle\infty}{K} = [\frac{1}{n} \overset{\scriptscriptstyle\infty}{K}(x_{i_1},x_{i_2}) : i_1, i_2 \in [1:n]]$ over a dataset $\{x_1,\cdots,x_n\} \subset \mathbb{R}^{m_0}$, transferred from the inverse cosine distance matrix via our approximation result. Our results show that $\Delta_\phi = \frac{b^2}{a^2+b^2}$ determines the rate at which the condition number of the NTK matrix converges to its limit as depth increases, implying in particular that the absolute value ($\Delta_\phi=1$) is better than the ReLU ($\Delta_\phi=\frac{1}{2}$) in this regard.


Reviews: Quantum Perceptron Models

Neural Information Processing Systems

In my opinion, this paper deserves the attention of the NIPS audience at least on poster level for sound investigation of an highly interdisciplinary and fundamental question whose answer might eventually have a strong impact on ML. To the best of my knowledge, the proofs and arguments presented are correct. While an experimental evaluation would of course have been very good to back up the theoretical claims, the absence of a device capable of executing the algorithm excuses the authors from my point of view. As far as quantum algorithms go, the results are hence well presented and analyzed. It is clear that this paper stands out in terms of originality and novelty (at least for the NIPS audience).


Frequency-domain MLPs are More Effective Learners in Time Series Forecasting

Neural Information Processing Systems

Time series forecasting has played the key role in different industrial, including finance, traffic, energy, and healthcare domains. While existing literatures have designed many sophisticated architectures based on RNNs, GNNs, or Transformers, another kind of approaches based on multi-layer perceptrons (MLPs) are proposed with simple structure, low complexity, and superior performance. However, most MLP-based forecasting methods suffer from the point-wise mappings and information bottleneck, which largely hinders the forecasting performance. To overcome this problem, we explore a novel direction of applying MLPs in the frequency domain for time series forecasting. We investigate the learned patterns of frequency-domain MLPs and discover their two inherent characteristic benefiting forecasting, (i) global view: frequency spectrum makes MLPs own a complete view for signals and learn global dependencies more easily, and (ii) energy compaction: frequency-domain MLPs concentrate on smaller key part of frequency components with compact signal energy.


A Machine Learning Framework for Handling Unreliable Absence Label and Class Imbalance for Marine Stinger Beaching Prediction

arXiv.org Machine Learning

Bluebottles (\textit{Physalia} spp.) are marine stingers resembling jellyfish, whose presence on Australian beaches poses a significant public risk due to their venomous nature. Understanding the environmental factors driving bluebottles ashore is crucial for mitigating their impact, and machine learning tools are to date relatively unexplored. We use bluebottle marine stinger presence/absence data from beaches in Eastern Sydney, Australia, and compare machine learning models (Multilayer Perceptron, Random Forest, and XGBoost) to identify factors influencing their presence. We address challenges such as class imbalance, class overlap, and unreliable absence data by employing data augmentation techniques, including the Synthetic Minority Oversampling Technique (SMOTE), Random Undersampling, and Synthetic Negative Approach that excludes the negative class. Our results show that SMOTE failed to resolve class overlap, but the presence-focused approach effectively handled imbalance, class overlap, and ambiguous absence data. The data attributes such as the wind direction, which is a circular variable, emerged as a key factor influencing bluebottle presence, confirming previous inference studies. However, in the absence of population dynamics, biological behaviours, and life cycles, the best predictive model appears to be Random Forests combined with Synthetic Negative Approach. This research contributes to mitigating the risks posed by bluebottles to beachgoers and provides insights into handling class overlap and unreliable negative class in environmental modelling.


Physics-Informed Machine Learning for Efficient Reconfigurable Intelligent Surface Design

arXiv.org Machine Learning

Reconfigurable intelligent surface (RIS) is a two-dimensional periodic structure integrated with a large number of reflective elements, which can manipulate electromagnetic waves in a digital way, offering great potentials for wireless communication and radar detection applications. However, conventional RIS designs highly rely on extensive full-wave EM simulations that are extremely time-consuming. To address this challenge, we propose a machine-learning-assisted approach for efficient RIS design. An accurate and fast model to predict the reflection coefficient of RIS element is developed by combining a multi-layer perceptron neural network (MLP) and a dual-port network, which can significantly reduce tedious EM simulations in the network training. A RIS has been practically designed based on the proposed method. To verify the proposed method, the RIS has also been fabricated and measured. The experimental results are in good agreement with the simulation results, which validates the efficacy of the proposed method in RIS design.


Scaling MLPs: A Tale of Inductive Bias

Neural Information Processing Systems

In this work we revisit the most fundamental building block in deep learning, the multi-layer perceptron (MLP), and study the limits of its performance on vision tasks. Empirical insights into MLPs are important for multiple reasons. To that end, MLPs offer an ideal test bed, as they lack any vision-specific inductive bias. Surprisingly, experimental datapoints for MLPs are very difficult to find in the literature, especially when coupled with large pre-training protocols. This discrepancy between practice and theory is worrying: \textit{Do MLPs reflect the empirical advances exhibited by practical models?}


Wide Neural Networks as Gaussian Processes: Lessons from Deep Equilibrium Models

Neural Information Processing Systems

Neural networks with wide layers have attracted significant attention due to their equivalence to Gaussian processes, enabling perfect fitting of training data while maintaining generalization performance, known as benign overfitting. However, existing results mainly focus on shallow or finite-depth networks, necessitating a comprehensive analysis of wide neural networks with infinite-depth layers, such as neural ordinary differential equations (ODEs) and deep equilibrium models (DEQs). In this paper, we specifically investigate the deep equilibrium model (DEQ), an infinite-depth neural network with shared weight matrices across layers. Our analysis reveals that as the width of DEQ layers approaches infinity, it converges to a Gaussian process, establishing what is known as the Neural Network and Gaussian Process (NNGP) correspondence. Remarkably, this convergence holds even when the limits of depth and width are interchanged, which is not observed in typical infinite-depth Multilayer Perceptron (MLP) networks.