AITopics

2312.13385

Country:

Asia > Middle East > Israel (0.16)
North America > United States > Texas (0.14)

Genre: Research Report > New Finding (0.68)

Industry:

Media > Photography (0.46)
Energy > Oil & Gas (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (0.46)

arXiv.org Artificial IntelligenceMar-9-2023

Provable Data Subset Selection For Efficient Neural Network Training

Tukan, Murad, Zhou, Samson, Maalouf, Alaa, Rus, Daniela, Braverman, Vladimir, Feldman, Dan

Radial basis function neural networks (\emph{RBFNN}) are {well-known} for their capability to approximate any continuous function on a closed bounded set with arbitrary precision given enough hidden neurons. In this paper, we introduce the first algorithm to construct coresets for \emph{RBFNNs}, i.e., small weighted subsets that approximate the loss of the input data on any radial basis function network and thus approximate any function defined by an \emph{RBFNN} on the larger input data. In particular, we construct coresets for radial basis and Laplacian loss functions. We then use our coresets to obtain a provable data subset selection algorithm for training deep neural networks. Since our coresets approximate every function, they also approximate the gradient of each weight in a neural network, which is a particular function on the input. We then perform empirical evaluations on function approximation and dataset subset selection on popular network architectures and data sets, demonstrating the efficacy and accuracy of our coreset construction.

artificial intelligence, coreset, machine learning, (16 more...)

2303.05151

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

arXiv.org Artificial IntelligenceJul-23-2021

Compressing Neural Networks: Towards Determining the Optimal Layer-wise Decomposition

Liebenwein, Lucas, Maalouf, Alaa, Gal, Oren, Feldman, Dan, Rus, Daniela

We present a novel global compression framework for deep neural networks that automatically analyzes each layer to identify the optimal per-layer compression ratio, while simultaneously achieving the desired overall compression. Our algorithm hinges on the idea of compressing each convolutional (or fully-connected) layer by slicing its channels into multiple groups and decomposing each group via low-rank decomposition. At the core of our algorithm is the derivation of layer-wise error bounds from the Eckart Young Mirsky theorem. We then leverage these bounds to frame the compression problem as an optimization problem where we wish to minimize the maximum compression error across layers and propose an efficient algorithm towards a solution. Our experiments indicate that our method outperforms existing low-rank compression approaches across a wide range of networks and data sets. We believe that our results open up new avenues for future research into the global performance-size trade-offs of modern neural networks. Our code is available at https://github.com/lucaslie/torchprune.

compression ratio, deep learning, neural network, (19 more...)

2107.11442

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.34)

Industry: Government (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Artificial IntelligenceNov-18-2020

Introduction to Core-sets: an Updated Survey

Feldman, Dan

In optimization or machine learning problems we are given a set of items, usually points in some metric space, and the goal is to minimize or maximize an objective function over some space of candidate solutions. For example, in clustering problems, the input is a set of points in some metric space, and a common goal is to compute a set of centers in some other space (points, lines) that will minimize the sum of distances to these points. In database queries, we may need to compute such a some for a specific query set of k centers. However, traditional algorithms cannot handle modern systems that require parallel real-time computations of infinite distributed streams from sensors such as GPS, audio or video that arrive to a cloud, or networks of weaker devices such as smartphones or robots. Core-set is a "small data" summarization of the input "big data", where every possible query has approximately the same answer on both data sets. Generic techniques enable efficient coreset maintenance of streaming, distributed and dynamic data. Traditional algorithms can then be applied on these coresets to maintain the approximated optimal solutions. The challenge is to design coresets with provable tradeoff between their size and approximation error. This survey summarizes such constructions in a retrospective way, that aims to unified and simplify the state-of-the-art. Bringing big data to the enterprise, 2012) are generated by cheap and numerous information-sensing mobile devices, remote sensing, software logs, cameras, microphones, RFID readers and wireless sensor networks (Segaran & Hammerbacher, 2009; Hellerstein, 2008; Funke & Laue, 2007). These require clustering algorithms that, unlike traditional algorithms, (a) learn unbounded streaming data that cannot fit into main memory, (b) run in parallel on distributed data among thousands of machines, (c) use low communication between the machines (d) apply real-time computations on the device, (e) handle privacy and security issues. A common approach is to reinvent computer science for handling these new computational models, and develop new algorithms "from scratch" independently of existing solutions.

artificial intelligence, coreset, optimization problem, (18 more...)

2011.09384

Country:

North America > United States (0.28)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre:

Overview (0.66)
Research Report (0.64)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Networks > Sensor Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
(2 more...)

arXiv.org Machine LearningOct-8-2020

Deep Learning Meets Projective Clustering

Maalouf, Alaa, Lang, Harry, Rus, Daniela, Feldman, Dan

A common approach for compressing NLP networks is to encode the embedding layer as a matrix $A\in\mathbb{R}^{n\times d}$, compute its rank-$j$ approximation $A_j$ via SVD, and then factor $A_j$ into a pair of matrices that correspond to smaller fully-connected layers to replace the original embedding layer. Geometrically, the rows of $A$ represent points in $\mathbb{R}^d$, and the rows of $A_j$ represent their projections onto the $j$-dimensional subspace that minimizes the sum of squared distances ("errors") to the points. In practice, these rows of $A$ may be spread around $k>1$ subspaces, so factoring $A$ based on a single subspace may lead to large errors that turn into large drops in accuracy. Inspired by \emph{projective clustering} from computational geometry, we suggest replacing this subspace by a set of $k$ subspaces, each of dimension $j$, that minimizes the sum of squared distances over every point (row in $A$) to its \emph{closest} subspace. Based on this approach, we provide a novel architecture that replaces the original embedding layer by a set of $k$ small layers that operate in parallel and are then recombined with a single fully-connected layer. Extensive experimental results on the GLUE benchmark yield networks that are both more accurate and smaller compared to the standard matrix factorization (SVD). For example, we further compress DistilBERT by reducing the size of the embedding layer by $40\%$ while incurring only a $0.5\%$ average drop in accuracy over all nine GLUE tasks, compared to a $2.8\%$ drop using the existing SVD approach. On RoBERTa we achieve $43\%$ compression of the embedding layer with less than a $0.8\%$ average drop in accuracy as compared to a $3\%$ drop previously. Open code for reproducing and extending our results is provided.

deep learning, neural network, subspace, (19 more...)

2010.0429

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.82)

arXiv.org Artificial IntelligenceSep-11-2020

Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank Approximation

Tukan, Murad, Maalouf, Alaa, Weksler, Matan, Feldman, Dan

A common technique for compressing a neural network is to compute the $k$-rank $\ell_2$ approximation $A_{k,2}$ of the matrix $A\in\mathbb{R}^{n\times d}$ that corresponds to a fully connected layer (or embedding layer). Here, $d$ is the number of the neurons in the layer, $n$ is the number in the next one, and $A_{k,2}$ can be stored in $O((n+d)k)$ memory instead of $O(nd)$. This $\ell_2$-approximation minimizes the sum over every entry to the power of $p=2$ in the matrix $A - A_{k,2}$, among every matrix $A_{k,2}\in\mathbb{R}^{n\times d}$ whose rank is $k$. While it can be computed efficiently via SVD, the $\ell_2$-approximation is known to be very sensitive to outliers ("far-away" rows). Hence, machine learning uses e.g. Lasso Regression, $\ell_1$-regularization, and $\ell_1$-SVM that use the $\ell_1$-norm. This paper suggests to replace the $k$-rank $\ell_2$ approximation by $\ell_p$, for $p\in [1,2]$. We then provide practical and provable approximation algorithms to compute it for any $p\geq1$, based on modern techniques in computational geometry. Extensive experimental results on the GLUE benchmark for compressing BERT, DistilBERT, XLNet, and RoBERTa confirm this theoretical advantage. For example, our approach achieves $28\%$ compression of RoBERTa's embedding layer with only $0.63\%$ additive drop in the accuracy (without fine-tuning) in average over all tasks in GLUE, compared to $11\%$ drop using the existing $\ell_2$-approximation. Open code is provided for reproducing and extending our results.

artificial intelligence, matrix, neural network, (19 more...)

2009.05647

Country: North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

arXiv.org Machine LearningJun-18-2020

Coresets for Near-Convex Functions

Tukan, Murad, Maalouf, Alaa, Feldman, Dan

Coreset is usually a small weighted subset of $n$ input points in $\mathbb{R}^d$, that provably approximates their loss function for a given set of queries (models, classifiers, etc.). Coresets become increasingly common in machine learning since existing heuristics or inefficient algorithms may be improved by running them possibly many times on the small coreset that can be maintained for streaming distributed data. Coresets can be obtained by sensitivity (importance) sampling, where its size is proportional to the total sum of sensitivities. Unfortunately, computing the sensitivity of each point is problem dependent and may be harder to compute than the original optimization problem at hand. We suggest a generic framework for computing sensitivities (and thus coresets) for wide family of loss functions which we call near-convex functions. This is by suggesting the $f$-SVD factorization that generalizes the SVD factorization of matrices to functions. Example applications include coresets that are either new or significantly improves previous results, such as SVM, Logistic regression, M-estimators, and $\ell_z$-regression. Experimental results and open source are also provided.

artificial intelligence, optimization problem, sensitivity, (17 more...)

2006.05482

Country:

Europe > Spain (0.27)
Asia > Middle East > Israel (0.14)
North America > United States > California (0.14)

Genre: Research Report > New Finding (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.35)

arXiv.org Machine LearningJun-9-2020

Faster PAC Learning and Smaller Coresets via Smoothed Analysis

Maalouf, Alaa, Jubran, Ibrahim, Tukan, Murad, Feldman, Dan

PAC-learning usually aims to compute a small subset ($\varepsilon$-sample/net) from $n$ items, that provably approximates a given loss function for every query (model, classifier, hypothesis) from a given set of queries, up to an additive error $\varepsilon\in(0,1)$. Coresets generalize this idea to support multiplicative error $1\pm\varepsilon$. Inspired by smoothed analysis, we suggest a natural generalization: approximate the \emph{average} (instead of the worst-case) error over the queries, in the hope of getting smaller subsets. The dependency between errors of different queries implies that we may no longer apply the Chernoff-Hoeffding inequality for a fixed query, and then use the VC-dimension or union bound. This paper provides deterministic and randomized algorithms for computing such coresets and $\varepsilon$-samples of size independent of $n$, for any finite set of queries and loss function. Example applications include new and improved coreset constructions for e.g. streaming vector summarization [ICML'17] and $k$-PCA [NIPS'16]. Experimental results with open source code are provided.

artificial intelligence, construction, machine learning, (14 more...)

2006.05441

Country:

Asia > Middle East > Israel (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (1.00)

arXiv.org Machine LearningMar-9-2020

Sets Clustering

Jubran, Ibrahim, Tukan, Murad, Maalouf, Alaa, Feldman, Dan

The input to the \emph{sets-$k$-means} problem is an integer $k\geq 1$ and a set $\mathcal{P}=\{P_1,\cdots,P_n\}$ of sets in $\mathbb{R}^d$. The goal is to compute a set $C$ of $k$ centers (points) in $\mathbb{R}^d$ that minimizes the sum $\sum_{P\in \mathcal{P}} \min_{p\in P, c\in C}\left\| p-c \right\|^2$ of squared distances to these sets. An \emph{$\varepsilon$-core-set} for this problem is a weighted subset of $\mathcal{P}$ that approximates this sum up to $1\pm\varepsilon$ factor, for \emph{every} set $C$ of $k$ centers in $\mathbb{R}^d$. We prove that such a core-set of $O(\log^2{n})$ sets always exists, and can be computed in $O(n\log{n})$ time, for every input $\mathcal{P}$ and every fixed $d,k\geq 1$ and $\varepsilon \in (0,1)$. The result easily generalized for any metric space, distances to the power of $z>0$, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ($1+\varepsilon$ approximation) for the sets-$k$-means problem that takes time near linear in $n$. This is the first result even for sets-mean on the plane ($k=1$, $d=2$). Open source code and experimental results for document classification and facility locations are also provided.

algorithm, artificial intelligence, survey article, (21 more...)

2003.04135

Country: North America > United States (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

arXiv.org Machine LearningOct-11-2019

SiPPing Neural Networks: Sensitivity-informed Provable Pruning of Neural Networks

Baykal, Cenk, Liebenwein, Lucas, Gilitschenski, Igor, Feldman, Dan, Rus, Daniela

We introduce a pruning algorithm that provably sparsifies the parameters of a trained model in a way that approximately preserves the model's predictive accuracy. Our algorithm uses a small batch of input points to construct a data-informed importance sampling distribution over the network's parameters, and adaptively mixes a sampling-based and deterministic pruning procedure to discard redundant weights. Our pruning method is simultaneously computationally efficient, provably accurate, and broadly applicable to various network architectures and data distributions. Our empirical comparisons show that our algorithm reliably generates highly compressed networks that incur minimal loss in performance relative to that of the original network. We present experimental results that demonstrate our algorithm's potential to unearth essential network connections that can be trained successfully in isolation, which may be of independent interest.

deep learning, neural network, null, (16 more...)

1910.05422

Genre: Research Report (1.00)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)