Goto

Collaborating Authors

 accuracy improvement


Improving the Straight-Through Estimator with Zeroth-Order Information

Neural Information Processing Systems

We study the problem of training neural networks with quantized parameters. Learning low-precision quantized parameters by enabling computation of gradients via the Straight-Through Estimator (STE) can be challenging. While the STE enables back-propagation, which is a first-order method, recent works have explored the use of zeroth-order (ZO) gradient descent for fine-tuning. We note that the STE provides high-quality biased gradients, and ZO gradients are unbiased but can be expensive. We thus propose First-Order-Guided Zeroth-Order Gradient Descent (FOGZO) that reduces STE bias while reducing computations relative to ZO methods. Empirically, we show FOGZO improves the tradeoff between quality and training time in Quantization-Aware Pre-Training. Specifically, versus STE at the same number of iterations, we show a 1-8% accuracy improvement for DeiT Tiny/Small, 1-2% accuracy improvement on ResNet 18/50, and 1-22 perplexity point improvement for LLaMA models with up to 0.3 billion parameters. For the same loss, FOGZO yields a 796$\times$ reduction in computation versus n-SPSA for a 2-layer MLP on MNIST.


Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models

Neural Information Processing Systems

While recent success of large reasoning models (LRMs) significantly advanced LLMs' reasoning capability by optimizing the final answer accuracy using reinforcement learning, they may also drastically increase the output length due to --characterized by unnecessarily complex reasoning paths that waste computation and potentially degrade the performance. We hypothesize that such inefficiencies stem from LRMs' limited capability to dynamically select the proper modular reasoning strategies, termed at the right position. To investigate this hypothesis, we propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns, systematically identifying and promoting beneficial patterns that improve the answer while removing detrimental ones. Empirical analysis confirms that our optimized thinking paths yield more concise yet sufficiently informative trajectories, enhancing reasoning efficiency by reducing attention FLOPs by up to 47% while maintaining accuracy for originally correct responses. Moreover, a non-trivial portion of originally incorrect responses are transformed into correct ones, achieving a 15.6% accuracy improvement with reduced length. Motivated by the improvement brought by the optimized thinking paths, we apply a preference optimization technique supported by a pairwise dataset contrasting suboptimal and optimal reasoning paths. Experimental evaluations across multiple mathematical reasoning benchmarks reveal that our method notably reduces computational overhead while simultaneously improving reasoning accuracy, achieving up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.


Physics-informed machine learning with domain decomposition and global dynamics for three-dimensional intersecting flows

Neural Information Processing Systems

Physics-informed neural networks (PINNs) have emerged as a promising framework to develop complex scientific surrogate models, yet their scalability and accuracy often degrade in non-canonical geometries, such as non-rectangular domains or three-dimensional (3D) domains with high aspect ratios. These limitations hinder the broader adoption of vanilla PINNs in real-world, practical systems. In this work, we introduce a multi-domain PINN (MDPINN) framework designed to address the scalability and generalization challenges inherent in 3D non-rectangular domains governed by nonlinear fluid dynamics. The target domain consists of intersecting 3D fluid channels with a high aspect ratio, inducing complex flow features such as deflections, mixing, and recirculations. Our approach is grounded in two key innovations: 1) domain decomposition, which partitions the channel volumes into multiple cubic-like subdomains, each modeled by an individual PINN, 2) enforcement of global dynamics (MDPINN-GD), which ensures that the total mass flow rate entering the domain equals that exiting. These innovations reduce the complexity of the problem imposed on individual PINNs and guide effective network optimization toward physically consistent solutions throughout the domain. We demonstrate that our method achieves: 1) 74.8\% accuracy improvement over a single-network PINN, and 2) 52.9\% accuracy improvement over MDPINN that do not enforce global mass conservation. Furthermore, the MDPINN-GD framework exhibits accurate prediction even in highly complex regions-such as the channel intersecting zone and the outlet zone characterized by intense flow mixing and large velocity gradients-achieving maximum normalized mean absolute errors below 14.9\% for velocity predictions compared to simulation results. This work establishes a path towards scalable, physically grounded surrogate modeling approach that is extensible to multiphysics and high-dimensional scientific problems.



Neural Priming for Sample-Efficient Adaptation Matthew Wallingford Vivek Ramanujan Alex Fang Aditya Kusupati

Neural Information Processing Systems

Presented with class names or unlabeled test samples, Neural Priming enables the model to recall and conditions its parameters on relevant data seen throughout pretraining, thereby priming it for the test distribution. Neural Priming can be performed at inference, even for pretraining datasets as large as LAION-2B. Performing lightweight updates on the recalled data significantly improves accuracy across a variety of distribution shift and transfer learning benchmarks.



Neural Priming for Sample-Efficient Adaptation

Neural Information Processing Systems

We propose Neural Priming, a technique for adapting large pretrained models to distribution shifts and downstream tasks given few or no labeled examples. Presented with class names or unlabeled test samples, Neural Priming enables the model to recall and conditions its parameters on relevant data seen throughout pretraining, thereby priming it for the test distribution. Neural Priming can be performed at test time in even for pretraining datasets as large as LAION-2B. Performing lightweight updates on the recalled data significantly improves accuracy across a variety of distribution shift and transfer learning benchmarks. Concretely, in the zero-shot setting, we see a 2.45% improvement in accuracy on ImageNet and 3.81% accuracy improvement on average across standard transfer learning benchmarks. Further, using our test time inference scheme, we see a 1.41% accuracy improvement on ImageNetV2. These results demonstrate the effectiveness of Neural Priming in addressing the common challenge of limited labeled data and changing distributions.


Remember the Past: Distilling Datasets into Addressable Memories for Neural Networks

Neural Information Processing Systems

We propose an algorithm that compresses the critical information of a large dataset into compact addressable memories. These memories can then be recalled to quickly re-train a neural network and recover the performance (instead of storing and re-training on the full original dataset). Building upon the dataset distillation framework, we make a key observation that a shared common representation allows for more efficient and effective distillation. Concretely, we learn a set of bases (aka ``memories'') which are shared between classes and combined through learned flexible addressing functions to generate a diverse set of training examples. This leads to several benefits: 1) the size of compressed data does not necessarily grow linearly with the number of classes; 2) an overall higher compression rate with more effective distillation is achieved; and 3) more generalized queries are allowed beyond recalling the original classes. We demonstrate state-of-the-art results on the dataset distillation task across five benchmarks, including up to 16.5% and 9.7% accuracy improvement when distilling CIFAR10 and CIFAR100 respectively. We then leverage our framework to perform continual learning, achieving state-of-the-art results on four benchmarks, with 23.2% accuracy improvement on MANY.


Predicting Human Chess Moves: An AI Assisted Analysis of Chess Games Using Skill-group Specific n-gram Language Models

arXiv.org Artificial Intelligence

Chess, a deterministic game with perfect information, has long served as a benchmark for studying strategic decision-making and artificial intelligence. Traditional chess engines or tools for analysis primarily focus on calculating optimal moves, often neglecting the variability inherent in human chess playing, particularly across different skill levels. To overcome this limitation, we propose a novel and computationally efficient move prediction framework that approaches chess move prediction as a behavioral analysis task. The framework employs n-gram language models to capture move patterns characteristic of specific player skill levels. By dividing players into seven distinct skill groups, from novice to expert, we trained separate models using data from the open-source chess platform Lichess. The framework dynamically selects the most suitable model for prediction tasks and generates player moves based on preceding sequences. Evaluation on real-world game data demonstrates that the model selector module within the framework can classify skill levels with an accuracy of up to 31.7\% when utilizing early game information (16 half-moves). The move prediction framework also shows substantial accuracy improvements, with our Selector Assisted Accuracy being up to 39.1\% more accurate than our benchmark accuracy. The computational efficiency of the framework further enhances its suitability for real-time chess analysis.


Slimmed Asymmetrical Contrastive Learning and Cross Distillation for Lightweight Model Training Jian Meng, Li Y ang

Neural Information Processing Systems

Contrastive learning (CL) has been widely investigated with various learning mechanisms and achieves strong capability in learning representations of data in a self-supervised manner using unlabeled data. A common fashion of contrastive learning on this line is employing large-sized encoders to achieve comparable performance as the supervised learning counterpart. Despite the success of the labelless training, current contrastive learning algorithms failed to achieve good performance with lightweight (compact) models, e.g., MobileNet, while the requirements of the heavy encoders impede the energy-efficient computation, especially for resource-constrained AI applications. Motivated by this, we propose a new self-supervised CL scheme, named SACL-XD, consisting of two technical components, S limmed A symmetrical C ontrastive L earning (SACL) and Cross - D istillation (XD), which collectively enable efficient CL with compact models.