Goto

Collaborating Authors

 fixing


Author response for " Fixing the train-test resolution discrepancy "

Neural Information Processing Systems

We thank the reviewers for their constructive feedback on the paper. Here we answer their main questions and comments. In addition, are the results shown significant? In particular, we have evaluated our approach for transfer learning for low-resource and/or fine-grained classification. Then (3) we use our method, i.e. we fine-tune the last Finally, we applied our method to a very large ResNeXt-101 32x48d from [Mahajan et al.


Fixing the train-test resolution discrepancy

Neural Information Processing Systems

Data-augmentation is key to the training of neural networks for image classification. This paper first shows that existing augmentations induce a significant discrepancy between the size of the objects seen by the classifier at train and test time: in fact, a lower train resolution improves the classification at test time! We then propose a simple strategy to optimize the classifier performance, that employs different train and test resolutions. It relies on a computationally cheap fine-tuning of the network at the test resolution. This enables training strong classifiers using small training images, and therefore significantly reduce the training time. For instance, we obtain 77.1% top-1 accuracy on ImageNet with a ResNet-50 trained on 128x128 images, and 79.8% with one trained at 224x224. A ResNeXt-101 32x48d pre-trained with weak supervision on 940 million 224x224 images and further optimized with our technique for test resolution 320x320 achieves 86.4% top-1 accuracy (top-5: 98.0%). To the best of our knowledge this is the highest ImageNet single-crop accuracy to date.


Fixing the NTK: From Neural Network Linearizations to Exact Convex Programs

Neural Information Processing Systems

Recently, theoretical analyses of deep neural networks have broadly focused on two directions: 1) Providing insight into neural network training by SGD in the limit of infinite hidden-layer width and infinitesimally small learning rate (also known as gradient flow) via the Neural Tangent Kernel (NTK), and 2) Globally optimizing the regularized training objective via cone-constrained convex reformulations of ReLU networks. The latter research direction also yielded an alternative formulation of the ReLU network, called a gated ReLU network, that is globally optimizable via efficient unconstrained convex programs. In this work, we interpret the convex program for this gated ReLU network as a Multiple Kernel Learning (MKL) model with a weighted data masking feature map and establish a connection to the NTK. Specifically, we show that for a particular choice of mask weights that do not depend on the learning targets, this kernel is equivalent to the NTK of the gated ReLU network on the training data. A consequence of this lack of dependence on the targets is that the NTK cannot perform better than the optimal MKL kernel on the training set. By using iterative reweighting, we improve the weights induced by the NTK to obtain the optimal MKL kernel which is equivalent to the solution of the exact convex reformulation of the gated ReLU network. We also provide several numerical simulations corroborating our theory. Additionally, we provide an analysis of the prediction error of the resulting optimal kernel via consistency results for the group lasso.


New study reveals threats to the Class of 2025. Fixing them should be Job No. 1 for America

FOX News

FOX Business' Taylor Riggs joins'Fox & Friends' to discuss her take on the June jobs report, Democrats' attacks against the legislation and why they claim it will target Medicaid. This summer should be bringing the Class of 2025 a moment of well-deserved relaxation before they launch their careers. Instead, far too many college and high-school graduates are filled with anxiety. They've applied for dozens, perhaps hundreds, of jobs, but interviews and offers have become increasingly rare. The national unemployment rate for young adults aged 20 to 24 looking for work is 6.6% -- the highest level in a decade, excluding the pandemic unemployment spike.


Fixing the Double Penalty in Data-Driven Weather Forecasting Through a Modified Spherical Harmonic Loss Function

Subich, Christopher, Husain, Syed Zahid, Separovic, Leo, Yang, Jing

arXiv.org Artificial Intelligence

Beginning in 2023, the release of data-driven atmospheric forecasting models powered by deep neural network architectures began a revolution in medium-range weather forecasting, with some commenters [Bauer, 2024] anticipating that data-driven forecasting will soon supplant traditional numerical weather prediction (NWP) systems in all operational contexts. GraphCast [Lam et al., 2023], FourCastNet [Kurth et al., 2023], and Pangu-Weather [Bi et al., 2023] demonstrated forecast skill superior to that of the high-resolution forecast system (IFS) of the European Centre for Medium Range Weather Forecasts (ECMWF) at lead times (forecast lengths) up to 10 days. Since the publication of these models, the field has been joined by many others, including the Artificial Intelligence Forecasting System (AIFS) developed by ECMWF itself [Lang et al., 2024a]. From the standpoint of machine learning, atmospheric forecasting is a large-scale generative problem comparable to predicting the next frame of a video. As a typical example, the version of the GraphCast model deployed experimentally by the National Oceanic and Atmospheric Administration (NOAA) [NOAA, 2024] predicts the 6-hour forecast for six atmospheric variables at each of 13 vertical levels plus five surface variables, on a latitude/longitude grid, for about 86 million output degrees of freedom in aggregate. GraphCast takes two time-levels as input, so the input for this model has about 170 million degrees of freedom. These first-generation data-driven weather models generally act as deterministic forecast systems, where each unique initial condition is mapped to a single forecast and verified against a "ground truth" from a data analysis system. The ERA5 atmospheric reanalysis [Hersbach et al., 2020] of ECWMF is most often used as the source of initial and verifying data for these forecast systems owing to its high quality and consistent behaviour from 1979 to present.


Reviews: Fixing the train-test resolution discrepancy

Neural Information Processing Systems

Clarity: The paper is clearly written and easy to follow. Significance: The results in the paper are significant for the practitioners and existing deployments as they shed light on the train-test resolution discrepancy and suggest method to improve test performance for existing trained models. Novelty: The analysis in this paper is novel (though improved performance on higher resolution images has been observed earlier). Questions: While the focus is on fixing discrepancy after the model has been initially trained, why not just fix the training such that there is no discrepancy, as opposed to changing the size for test and finetuning? Line 110-111 derives f sqrt(HW), which does not seem to be right since k doesn't include the sensor size.


Reviews: Fixing the train-test resolution discrepancy

Neural Information Processing Systems

I think the paper addresses an interesting problem, albeit limited in scope to computer vision. I am sure practitioners in that field will appreciate the paper's findings. Two of the reviewers were positive, and reaffirmed their position during the post-rebuttal discussion, while R1 remained concerned, in particular regarding lack of rigorous statistical analysis of the results. The other reviewers did not consider that issue a deal-breaker, and I agree and recommend to accept.


Fixing the Loose Brake: Exponential-Tailed Stopping Time in Best Arm Identification

Balagopalan, Kapilan, Nguyen, Tuan Ngo, Zhao, Yao, Jun, Kwang-Sung

arXiv.org Machine Learning

The best arm identification problem requires identifying the best alternative (i.e., arm) in active experimentation using the smallest number of experiments (i.e., arm pulls), which is crucial for cost-efficient and timely decision-making processes. In the fixed confidence setting, an algorithm must stop data-dependently and return the estimated best arm with a correctness guarantee. Since this stopping time is random, we desire its distribution to have light tails. Unfortunately, many existing studies focus on high probability or in expectation bounds on the stopping time, which allow heavy tails and, for high probability bounds, even not stopping at all. We first prove that this never-stopping event can indeed happen for some popular algorithms. Motivated by this, we propose algorithms that provably enjoy an exponential-tailed stopping time, which improves upon the polynomial tail bound reported by Kalyanakrishnan et al. (2012). The first algorithm is based on a fixed budget algorithm called Sequential Halving along with a doubling trick. The second algorithm is a meta algorithm that takes in any fixed confidence algorithm with a high probability stopping guarantee and turns it into one that enjoys an exponential-tailed stopping time. Our results imply that there is much more to be desired for contemporary fixed confidence algorithms.


Fixing the train-test resolution discrepancy

Neural Information Processing Systems

Data-augmentation is key to the training of neural networks for image classification. This paper first shows that existing augmentations induce a significant discrepancy between the size of the objects seen by the classifier at train and test time: in fact, a lower train resolution improves the classification at test time! We then propose a simple strategy to optimize the classifier performance, that employs different train and test resolutions. It relies on a computationally cheap fine-tuning of the network at the test resolution. This enables training strong classifiers using small training images, and therefore significantly reduce the training time. For instance, we obtain 77.1% top-1 accuracy on ImageNet with a ResNet-50 trained on 128x128 images, and 79.8% with one trained at 224x224.


Fixing the NTK: From Neural Network Linearizations to Exact Convex Programs

Neural Information Processing Systems

Recently, theoretical analyses of deep neural networks have broadly focused on two directions: 1) Providing insight into neural network training by SGD in the limit of infinite hidden-layer width and infinitesimally small learning rate (also known as gradient flow) via the Neural Tangent Kernel (NTK), and 2) Globally optimizing the regularized training objective via cone-constrained convex reformulations of ReLU networks. The latter research direction also yielded an alternative formulation of the ReLU network, called a gated ReLU network, that is globally optimizable via efficient unconstrained convex programs. In this work, we interpret the convex program for this gated ReLU network as a Multiple Kernel Learning (MKL) model with a weighted data masking feature map and establish a connection to the NTK. Specifically, we show that for a particular choice of mask weights that do not depend on the learning targets, this kernel is equivalent to the NTK of the gated ReLU network on the training data. A consequence of this lack of dependence on the targets is that the NTK cannot perform better than the optimal MKL kernel on the training set. By using iterative reweighting, we improve the weights induced by the NTK to obtain the optimal MKL kernel which is equivalent to the solution of the exact convex reformulation of the gated ReLU network.