AITopics

2411.10957

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)

arXiv.org Artificial IntelligenceOct-30-2024

A Kernel Perspective on Distillation-based Collaborative Learning

Park, Sejun, Hong, Kihun, Hwang, Ganguk

Over the past decade, there is a growing interest in collaborative learning that can enhance AI models of multiple parties. However, it is still challenging to enhance performance them without sharing private data and models from individual parties. One recent promising approach is to develop distillation-based algorithms that exploit unlabeled public data but the results are still unsatisfactory in both theory and practice. To tackle this problem, we rigorously analyze a representative distillation-based algorithm in the view of kernel regression. This work provides the first theoretical results to prove the (nearly) minimax optimality of the nonparametric collaborative learning algorithm that does not directly share local data or models in massively distributed statistically heterogeneous environments. Inspired by our theoretical results, we also propose a practical distillation-based collaborative learning algorithm based on neural network architecture. Our algorithm successfully bridges the gap between our theoretical assumptions and practical settings with neural networks through feature kernel matching. We simulate various regression tasks to verify our theory and demonstrate the practical feasibility of our proposed algorithm.

algorithm, artificial intelligence, machine learning, (17 more...)

2410.17592

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.92)

Industry: Information Technology > Security & Privacy (0.66)

Technology:

Information Technology > Communications > Collaboration (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceJan-26-2024

Expressive Power of ReLU and Step Networks under Floating-Point Operations

Park, Yeachan, Hwang, Geonho, Lee, Wonyeol, Park, Sejun

The study of the expressive power of neural networks has investigated the fundamental limits of neural networks. Most existing results assume real-valued inputs and parameters as well as exact operations during the evaluation of neural networks. However, neural networks are typically executed on computers that can only represent a tiny subset of the reals and apply inexact operations. In this work, we analyze the expressive power of neural networks under a more realistic setup: when we use floating-point numbers and operations. Our first set of results assumes floating-point operations where the significand of a float is represented by finite bits but its exponent can take any integer value. Under this setup, we show that neural networks using a binary threshold unit or ReLU can memorize any finite input/output pairs and can approximate any continuous function within a small error. We also show similar results on memorization and universal approximation when floating-point operations use finite bits for both significand and exponent; these results are applicable to many popular floating-point formats such as those defined in the IEEE 754 standard (e.g., 32-bit single-precision format) and bfloat16.

artificial intelligence, machine learning, opération, (19 more...)

2401.15121

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Machine LearningSep-19-2023

Minimum width for universal approximation using ReLU networks on compact domain

Kim, Namjun, Min, Chanho, Park, Sejun

Understanding what neural networks can or cannot do is a fundamental problem in the expressive power of neural networks. Initial approaches for this problem mostly focus on depth-bounded networks. For example, a line of research studies the size of the two-layer neural network to memorize (i.e., perfectly fit) an arbitrary training dataset and shows that the number of parameters proportional to the dataset size is necessary and sufficient for various activation functions (Baum, 1988; Huang and Babri, 1998). Another important line of works investigates a class of functions that can be approximated by two-layer networks. Classical results in this field represented by the universal approximation theorem show that two-layer networks using a non-polynomial activation function are dense in the space of continuous functions on compact domains (Hornik et al., 1989; Cybenko, 1989; Leshno et al., 1993; Pinkus, 1999). Along with the success of deep learning, the expressive power of deep neural networks has been studied. As in the classical depth-bounded network results, several works have shown that deep neural networks with bounded width can memorize arbitrary training dataset (Yun et al., 2019; Vershynin, 2020) and can approximate any continuous function (Lu et al., 2017; Hanin and Sellke, 2017). Intriguingly, it has also been shown that deeper networks can be more expressive compared to shallow ones. For example, Telgarsky (2016); Eldan and Shamir (2016); Daniely (2017) show that there is a class of functions that can be approximated by deep width-bounded networks with a small number of parameters but cannot be approximated by shallow networks without extremely large width.

artificial intelligence, deep learning, machine learning, (19 more...)

2309.10402

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)

arXiv.org Artificial IntelligenceJun-6-2023

On the Correctness of Automatic Differentiation for Neural Networks with Machine-Representable Parameters

Lee, Wonyeol, Park, Sejun, Aiken, Alex

Recent work has shown that forward- and reverse- mode automatic differentiation (AD) over the reals is almost always correct in a mathematically precise sense. However, actual programs work with machine-representable numbers (e.g., floating-point numbers), not reals. In this paper, we study the correctness of AD when the parameter space of a neural network consists solely of machine-representable numbers. In particular, we analyze two sets of parameters on which AD can be incorrect: the incorrect set on which the network is differentiable but AD does not compute its derivative, and the non-differentiable set on which the network is non-differentiable. For a neural network with bias parameters, we first prove that the incorrect set is always empty. We then prove a tight bound on the size of the non-differentiable set, which is linear in the number of non-differentiabilities in activation functions, and give a simple necessary and sufficient condition for a parameter to be in this set. We further prove that AD always computes a Clarke subderivative even on the non-differentiable set. We also extend these results to neural networks possibly without bias parameters.

artificial intelligence, bias parameter, machine learning, (17 more...)

2301.1337

Country: North America > United States (0.67)

Genre: Research Report (0.64)

Industry: Energy (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Artificial IntelligenceMar-15-2023

Neural Networks Efficiently Learn Low-Dimensional Representations with SGD

Mousavi-Hosseini, Alireza, Park, Sejun, Girotti, Manuela, Mitliagkas, Ioannis, Erdogdu, Murat A.

We study the problem of training a two-layer neural network (NN) of arbitrary width using stochastic gradient descent (SGD) where the input $\boldsymbol{x}\in \mathbb{R}^d$ is Gaussian and the target $y \in \mathbb{R}$ follows a multiple-index model, i.e., $y=g(\langle\boldsymbol{u_1},\boldsymbol{x}\rangle,...,\langle\boldsymbol{u_k},\boldsymbol{x}\rangle)$ with a noisy link function $g$. We prove that the first-layer weights of the NN converge to the $k$-dimensional principal subspace spanned by the vectors $\boldsymbol{u_1},...,\boldsymbol{u_k}$ of the true model, when online SGD with weight decay is used for training. This phenomenon has several important consequences when $k \ll d$. First, by employing uniform convergence on this smaller subspace, we establish a generalization error bound of $O(\sqrt{{kd}/{T}})$ after $T$ iterations of SGD, which is independent of the width of the NN. We further demonstrate that, SGD-trained ReLU NNs can learn a single-index target of the form $y=f(\langle\boldsymbol{u},\boldsymbol{x}\rangle) + \epsilon$ by recovering the principal direction, with a sample complexity linear in $d$ (up to log factors), where $f$ is a monotonic function with at most polynomial growth, and $\epsilon$ is the noise. This is in contrast to the known $d^{\Omega(p)}$ sample requirement to learn any degree $p$ polynomial in the kernel regime, and it shows that NNs trained with SGD can outperform the neural tangent kernel at initialization. Finally, we also provide compressibility guarantees for NNs using the approximate low-rank structure produced by SGD.

artificial intelligence, machine learning, probability, (14 more...)

2209.14863

Country: North America > Canada (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

arXiv.org Artificial IntelligenceMar-6-2023

Guiding Energy-based Models via Contrastive Latent Variables

Lee, Hankook, Jeong, Jongheon, Park, Sejun, Shin, Jinwoo

An energy-based model (EBM) is a popular generative framework that offers both explicit density and architectural flexibility, but training them is difficult since it is often unstable and time-consuming. In recent years, various training techniques have been developed, e.g., better divergence measures or stabilization in MCMC sampling, but there often exists a large gap between EBMs and other generative frameworks like GANs in terms of generation quality. In this paper, we propose a novel and effective framework for improving EBMs via contrastive representation learning (CRL). To be specific, we consider representations learned by contrastive methods as the true underlying latent variable. This contrastive latent variable could guide EBMs to understand the data structure better, so it can improve and accelerate EBM training significantly. To enable the joint training of EBM and CRL, we also design a new class of latent-variable EBMs for learning the joint density of data and the contrastive latent variable. Our experimental results demonstrate that our scheme achieves lower FID scores, compared to prior-art EBM methods (e.g., additionally using variational autoencoders or diffusion techniques), even with significantly faster and more memory-efficient training. We also show conditional and compositional generation abilities of our latent-variable EBMs as their additional benefits, even without explicit conditional training. The code is available at https://github.com/hankook/CLEL.

artificial intelligence, international conference, machine learning, (14 more...)

2303.03023

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

arXiv.org Artificial IntelligenceNov-17-2021

SmoothMix: Training Confidence-calibrated Smoothed Classifiers for Certified Robustness

Jeong, Jongheon, Park, Sejun, Kim, Minkyu, Lee, Heung-Chang, Kim, Doguk, Shin, Jinwoo

Under the paradigm, the robustness of a classifier is aligned with the prediction confidence, i.e., the higher confidence from a smoothed classifier implies the better robustness. This motivates us to rethink the fundamental trade-off between accuracy and robustness in terms of calibrating confidences of a smoothed classifier. In this paper, we propose a simple training scheme, coined SmoothMix, to control the robustness of smoothed classifiers via self-mixup: it trains on convex combinations of samples along the direction of adversarial perturbation for each input. The proposed procedure effectively identifies over-confident, near off-class samples as a cause of limited robustness in case of smoothed classifiers, and offers an intuitive way to adaptively set a new decision boundary between these samples for better robustness.

classifier, data mining, machine learning, (19 more...)

2111.09277

Country:

North America > United States > California (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.93)

Industry:

Health & Medicine (0.67)
Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Mining (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Machine LearningJul-17-2020

Distribution Aligning Refinery of Pseudo-label for Imbalanced Semi-supervised Learning

Kim, Jaehyung, Hur, Youngbum, Park, Sejun, Yang, Eunho, Hwang, Sung Ju, Shin, Jinwoo

While semi-supervised learning (SSL) has proven to be a promising way for leveraging unlabeled data when labeled data is scarce, the existing SSL algorithms typically assume that training class distributions are balanced. However, these SSL algorithms trained under imbalanced class distributions can severely suffer when generalizing to a balanced testing criterion, since they utilize biased pseudo-labels of unlabeled data toward majority classes. To alleviate this issue, we formulate a convex optimization problem to softly refine the pseudo-labels generated from the biased model, and develop a simple algorithm, named Distribution Aligning Refinery of Pseudo-label (DARP) that solves it provably and efficiently. Under various class-imbalanced semi-supervised scenarios, we demonstrate the effectiveness of DARP and its compatibility with state-of-the-art SSL schemes.

algorithm, artificial intelligence, inductive learning, (16 more...)

2007.08844

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.87)

arXiv.org Machine LearningJun-15-2020

Minimum Width for Universal Approximation

Park, Sejun, Yun, Chulhee, Lee, Jaeho, Shin, Jinwoo

The study of the expressive power of neural networks investigates what class of functions neural networks can/cannot represent or approximate. Classical results in this field are mostly focused on shallow neural networks. An example of such results is the universal approximation theorem (Cybenko, 1989; Hornik et al., 1989; Pinkus, 1999), which shows that a neural network with fixed depth and arbitrary width can approximate any continuous function on a compact set, up to arbitrary accuracy, if the activation function is continuous and nonpolynomial. Another line of research studies the memory capacity of neural networks (Baum, 1988; Huang and Babri, 1998; Huang, 2003), trying to characterize the maximum number of data points that a given neural network can memorize. After the advent of deep learning, researchers started to investigate the benefit of depth in the expressive power of neural networks, in an attempt to understand the success of deep neural networks. This has led to interesting results showing the existence of functions that require the network to be extremely wide for shallow networks to approximate, while being easily approximated by deep and narrow networks (Telgarsky, 2016; Eldan and Shamir, 2016; Lin et al., 2017; Poggio et al., 2017). A similar tradeoff between depth and width in expressive power is also observed in the study of the memory capacity of neural networks (Yun et al., 2019; Vershynin, 2020). In search of a deeper understanding of the depth in neural networks, a dual scenario of the classical universal approximation theorem has also been studied (Lu et al., 2017; Hanin and Sellke, 2017; Johnson, 2019; Kidger and Lyons, 2020).

deep learning, neural network, relu network, (17 more...)

2006.08859

Genre: Research Report > New Finding (0.86)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)