belkin
Feature maps for the Laplacian kernel and its generalizations
Ahir, Sudhendu, Pandit, Parthe
Recent applications of kernel methods in machine learning have seen a renewed interest in the Laplacian kernel, due to its stability to the bandwidth hyperparameter in comparison to the Gaussian kernel, as well as its expressivity being equivalent to that of the neural tangent kernel of deep fully connected networks. However, unlike the Gaussian kernel, the Laplacian kernel is not separable. This poses challenges for techniques to approximate it, especially via the random Fourier features (RFF) methodology and its variants. In this work, we provide random features for the Laplacian kernel and its two generalizations: Mat\'{e}rn kernel and the Exponential power kernel. We provide efficiently implementable schemes to sample weight matrices so that random features approximate these kernels. These weight matrices have a weakly coupled heavy-tailed randomness. Via numerical experiments on real datasets we demonstrate the efficacy of these random feature maps.
Fast training of large kernel models with delayed projections
Abedsoltan, Amirhesam, Ma, Siyuan, Pandit, Parthe, Belkin, Mikhail
Classical kernel machines have historically faced significant challenges in scaling to large datasets and model sizes--a key ingredient that has driven the success of neural networks. In this paper, we present a new methodology for building kernel machines that can scale efficiently with both data size and model size. Our algorithm introduces delayed projections to Preconditioned Stochastic Gradient Descent (PSGD) allowing the training of much larger models than was previously feasible, pushing the practical limits of kernel-based learning. They have also served as the foundation 2024) leverage the Nyström Approximation (NA) in combination for understanding many significant phenomena in with other strategies to enhance performance. Despite these advantages, ASkotch combines it with block coordinate descent, the scalability of kernel methods has remained a persistent whereas Falkon combines it with the Conjugate Gradient challenge, particularly when applied to large datasets. However, this limitation is critical for expanding the utility these strategies are limited by model size due to memory of kernel-based techniques in modern machine learning applications.
On the Nystrom Approximation for Preconditioning in Kernel Machines
Abedsoltan, Amirhesam, Belkin, Mikhail, Pandit, Parthe, Rademacher, Luis
Kernel methods are a popular class of nonlinear predictive models in machine learning. Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning. Spectral preconditioning is an important tool to speed-up the convergence of such iterative algorithms for training kernel models. However computing and storing a spectral preconditioner can be expensive which can lead to large computational and storage overheads, precluding the application of kernel methods to problems with large datasets. A Nystrom approximation of the spectral preconditioner is often cheaper to compute and store, and has demonstrated success in practical applications. In this paper we analyze the trade-offs of using such an approximated preconditioner. Specifically, we show that a sample of logarithmic size (as a function of the size of the dataset) enables the Nystrom-based approximated preconditioner to accelerate gradient descent nearly as well as the exact preconditioner, while also reducing the computational and storage overheads.
Toward Large Kernel Models
Abedsoltan, Amirhesam, Belkin, Mikhail, Pandit, Parthe
Recent studies indicate that kernel machines can often perform similarly or better than deep neural networks (DNNs) on small datasets. The interest in kernel machines has been additionally bolstered by the discovery of their equivalence to wide neural networks in certain regimes. However, a key feature of DNNs is their ability to scale the model size and training data size independently, whereas in traditional kernel machines model size is tied to data size. Because of this coupling, scaling kernel machines to large data has been computationally challenging. In this paper, we provide a way forward for constructing large-scale general kernel models, which are a generalization of kernel machines that decouples the model and data, allowing training on large datasets. Specifically, we introduce EigenPro 3.0, an algorithm based on projected dual preconditioned SGD and show scaling to model and data sizes which have not been possible with existing kernel methods.
Cut your Losses with Squentropy
Hui, Like, Belkin, Mikhail, Wright, Stephen
Nearly all practical neural models for classification are trained using cross-entropy loss. Yet this ubiquitous choice is supported by little theoretical or empirical evidence. Recent work (Hui & Belkin, 2020) suggests that training using the (rescaled) square loss is often superior in terms of the classification accuracy. In this paper we propose the "squentropy" loss, which is the sum of two terms: the cross-entropy loss and the average square loss over the incorrect classes. We provide an extensive set of experiments on multi-class classification problems showing that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy. We also demonstrate that it provides significantly better model calibration than either of these alternative losses and, furthermore, has less variance with respect to the random initialization. Additionally, in contrast to the square loss, squentropy loss can typically be trained using exactly the same optimization parameters, including the learning rate, as the standard cross-entropy loss, making it a true "plug-and-play" replacement. Finally, unlike the rescaled square loss, multiclass squentropy contains no parameters that need to be adjusted.
A Deeper Understanding of Deep Learning
Deep learning should not work as well as it seems to: according to traditional statistics and machine learning, any analysis that has too many adjustable parameters will overfit noisy training data, and then fail when faced with novel test data. In clear violation of this principle, modern neural networks often use vastly more parameters than data points, but they nonetheless generalize to new data quite well. The shaky theoretical basis for generalization has been noted for many years. One proposal was that neural networks implicitly perform some sort of regularization--a statistical tool that penalizes the use of extra parameters. Yet efforts to formally characterize such an "implicit bias" toward smoother solutions have failed, said Roi Livni, an advanced lecturer in the department of electrical engineering of Israel's Tel Aviv University.
A New Link to an Old Model Could Crack the Mystery of Deep Learning
In the machine learning world, the sizes of artificial neural networks -- and their outsize successes -- are creating conceptual conundrums. When a network named AlexNet won an annual image recognition competition in 2012, it had about 60 million parameters. These parameters, fine-tuned during training, allowed AlexNet to recognize images that it had never seen before. Two years later, a network named VGG wowed the competition with more than 130 million such parameters. Some artificial neural networks, or ANNs, now have billions of parameters. These massive networks -- astoundingly successful at tasks such as classifying images, recognizing speech and translating text from one language to another -- have begun to dominate machine learning and artificial intelligence.
Belkin's $100 Soundform Connect dongle adds AirPlay 2 to any speaker
Some smart home aficionados still eulogize Google's Chromecast Audio, but Belkin's new Soundform Connect aims to fulfill a similar role -- for iOS users, anyway. The $100 dongle can connect to any traditional home speaker and turn it into an AirPlay 2-compatible smart speaker you can cast audio to from iPhones and iPads running iOS 11.4 and iPadOS 11.4 or newer, plus Macs running Catalina and Apple TVs with tvOS 11.4. And when we "any" home speaker, we really mean it. The Soundform has at least one nice touch the Chromecast doesn't -- beyond still existing, that is. In addition to the classic 3.5mm jack, there's also a port for standard optical connections -- the Chromecast Audio required audiophiles to own or purchase a TOSLINK-to-3.5mm According to Belkin, users will able to ask Siri to play their music or podcasts on the speaker in question, as well as ask the virtual assistant what's playing in each room and remotely control the speaker's volume.
Belkin SoundForm Elite Hi-Fi smart speaker review: The case of the missing midrange
My thoughts about the Belkin SoundForm Elite Hi-Fi Smart Speaker Wireless Charging can be distilled in a single word: boring. Listening to a $300 speaker should be exciting. Belkin doesn't have a track record of building great audio equipment, but its partner on this project--the French audiophile company Devialet--most certainly does. The Devialet Phantom blew my mind when I reviewed it five years ago. So, I had high hopes when I learned Belkin had enlisted that company's expertise to develop something more mainstream.
Tech can help trim your utility bills, which may be on the rise amid coronavirus shutdown
Now that most of the country is hunkered down at home to quell the spread of COVID-19, chances are you're spending more on electricity and other utilities. After all, more lights are on and for a longer period of time. You may be turning up the heat to stay warm (especially for northern states). Appliances – ovens, stoves, dishwashers, and washers and dryers – are getting more use than ever before. And then there's laptops and desktops constantly on for doing work or attending virtual classes at school, or perhaps binging TV shows or playing video games.