AITopics | Mikhail Belkin

Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate

Mikhail Belkin, Daniel J. Hsu, Partha Mitra

Neural Information Processing SystemsMar-27-2025, 03:03:03 GMT

Many modern machine learning models are trained to achieve zero or near-zero training error in order to obtain near-optimal (but non-zero) test error. This phenomenon of strong generalization performance for "overfitted" / interpolated classifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is consistently robust even when the data contain large amounts of label noise. Very little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted k-nearest neighbor schemes. Consistency or near-consistency is proved for these schemes in classification and regression problems.

artificial intelligence, interpolation, machine learning, (14 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland (0.46)
North America > United States > Ohio (0.40)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.49)

Add feedback

Clustering with Bregman Divergences: an Asymptotic Analysis

Chaoyue Liu, Mikhail Belkin

Neural Information Processing SystemsJan-20-2025, 19:34:07 GMT

Clustering, in particular k-means clustering, is a central topic in data analysis. Clustering with Bregman divergences is a recently proposed generalization of k-means clustering which has already been widely used in applications. In this paper we analyze theoretical properties of Bregman clustering when the number of the clusters k is large. We establish quantization rates and describe the limiting distribution of the centers as k, extending well-known results for k-means clustering.

artificial intelligence, bregman divergence, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (0.46)
Europe (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Graphons, mergeons, and so on!

Justin Eldridge, Mikhail Belkin, Yusu Wang

Neural Information Processing SystemsJan-20-2025, 17:03:49 GMT

In this work we develop a theory of hierarchical clustering for graphs. Our modeling assumption is that graphs are sampled from a graphon, which is a powerful and general model for generating graphs and analyzing large networks. Graphons are a far richer class of graph models than stochastic blockmodels, the primary setting for recent progress in the statistical theory of graph clustering. We define what it means for an algorithm to produce the "correct" clustering, give sufficient conditions in which a method is statistically consistent, and provide an explicit algorithm satisfying these properties.

artificial intelligence, graphon, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.91)

Add feedback

Diving into the shallows: a computational perspective on large-scale shallow learning

SIYUAN MA, Mikhail Belkin

Neural Information Processing SystemsOct-8-2024, 07:56:25 GMT

Remarkable recent success of deep neural networks has not been easy to analyze theoretically. It has been particularly hard to disentangle relative significance of architecture and optimization in achieving accurate classification on large datasets. On the flip side, shallow methods (such as kernel methods) have encountered obstacles in scaling to large data, despite excellent performance on smaller datasets, and extensive theoretical analysis. Practical methods, such as variants of gradient descent used so successfully in deep learning, seem to perform below par when applied to kernel methods. This difficulty has sometimes been attributed to the limitations of shallow architecture. In this paper we identify a basic limitation in gradient descent-based optimization methods when used in conjunctions with smooth kernels.

artificial intelligence, iteration, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)

Add feedback

Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate

Mikhail Belkin, Daniel J. Hsu, Partha Mitra

Neural Information Processing SystemsOct-8-2024, 07:37:36 GMT

Many modern machine learning models are trained to achieve zero or near-zero training error in order to obtain near-optimal (but non-zero) test error. This phenomenon of strong generalization performance for "overfitted" / interpolated classifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is consistently robust even when the data contain large amounts of label noise. Very little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted k-nearest neighbor schemes. Consistency or near-consistency is proved for these schemes in classification and regression problems.

artificial intelligence, interpolation, machine learning, (15 more...)

Neural Information Processing Systems

Country: