Well File:

SGD on Neural Networks Learns Functions of Increasing Complexity

Neural Information Processing Systems

We perform an experimental study of the dynamics of Stochastic Gradient Descent (SGD) in learning deep neural networks for several real and synthetic classification tasks. We show that in the initial epochs, almost all of the performance improvement of the classifier obtained by SGD can be explained by a linear classifier. More generally, we give evidence for the hypothesis that, as iterations progress, SGD learns functions of increasing complexity. This hypothesis can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the overparameterized regime. We also show that the linear classifier learned in the initial stages is "retained" throughout the execution even if training is continued to the point of zero training error, and complement this with a theoretical result in a simplified model. Key to our work is a new measure of how well one classifier explains the performance of another, based on conditional mutual information.


formalization of "how well a linear classifier explains the performance of a model " has many advantages over prior

Neural Information Processing Systems

We thank all the reviewers for their insightful comments and suggestions. Reviewer 1 We consider our mutual information framework to be a core contribution of our paper. "forget" the simple component even when trained to completion, provided it somehow learns the simple component Reviewer 2 Thank you for pointing out the relevant papers. We plan on including a separate section with such examples in the final version. Concretely, regarding "On the spectral bias of neural networks" [1]: They consider measuring "simplicity" via the Our metrics do not suffer from this issue - they are taken with respect to the true data distribution.


InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint

Neural Information Processing Systems

Text-conditioned motion synthesis has made remarkable progress with the emergence of diffusion models. However, the majority of these motion diffusion models are primarily designed for a single character and overlook multi-human interactions. In our approach, we strive to explore this problem by synthesizing human motion with interactions for a group of characters of any size in a zero-shot manner. The key aspect of our approach is the adaptation of human-wise interactions as pairs of human joints that can be either in contact or separated by a desired distance. In contrast to existing methods that necessitate training motion generation models on multi-human motion datasets with a fixed number of characters, our approach inherently possesses the flexibility to model human interactions involving an arbitrary number of individuals, thereby transcending the limitations imposed by the training data. We introduce a novel controllable motion generation method, InterControl, to encourage the synthesized motions maintaining the desired distance between joint pairs. It consists of a motion controller and an inverse kinematics guidance module that realistically and accurately aligns the joints of synthesized characters to the desired location. Furthermore, we demonstrate that the distance between joint pairs for human-wise interactions can be generated using an off-the-shelf Large Language Model (LLM). Experimental results highlight the capability of our framework to generate interactions with multiple human characters and its potential to work with off-the-shelf physics-based character simulators.



TAIA: Large Language Models are Out-of-Distribution Data Learners

Neural Information Processing Systems

Fine-tuning on task-specific question-answer pairs is a predominant method for enhancing the performance of instruction-tuned large language models (LLMs) on downstream tasks. However, in certain specialized domains, such as healthcare or harmless content generation, it is nearly impossible to obtain a large volume of high-quality data that matches the downstream distribution. To improve the performance of LLMs in data-scarce domains with domain-mismatched data, we re-evaluated the Transformer architecture and discovered that not all parameter updates during fine-tuning contribute positively to downstream performance. Our analysis reveals that within the self-attention and feed-forward networks, only the fine-tuned attention parameters are particularly beneficial when the training set's distribution does not fully align with the test set. Based on this insight, we propose an effective inference-time intervention method: Training All parameters but Inferring with only Attention (TAIA). We empirically validate TAIA using two general instruction-tuning datasets and evaluate it on seven downstream tasks involving math, reasoning, and knowledge understanding across LLMs of different parameter sizes and fine-tuning techniques. Our comprehensive experiments demonstrate that TAIA achieves superior improvements compared to both the fully fine-tuned model and the base model in most scenarios, with significant performance gains. The high tolerance of TAIA to data mismatches makes it resistant to jailbreaking tuning and enhances specialized tasks using general data. Code is available in https://github.com/pixas/TAIA_LLM.


Coarse-to-Fine Concept Bottleneck Models Dino Ienco 1,2,3,4 Diego Marcos Inria 2

Neural Information Processing Systems

Deep learning algorithms have recently gained significant attention due to their impressive performance. However, their high complexity and un-interpretable mode of operation hinders their confident deployment in real-world safety-critical tasks.


b33128cb0089003ddfb5199e1b679652-AuthorFeedback.pdf

Neural Information Processing Systems

Response to Reviewer 1: Thank you for your detailed review. First, our results are not subsumed by [1] (we use your reference numbering). Our responses to your technical comments use your enumeration. We have not responded to points we do not dispute. Define A(R, ฮป, ฮธ) as the set on the right hand side of (7).


Average Case Column Subset Selection for Entrywise $\ell_1$-Norm Loss

Neural Information Processing Systems

Nevertheless, we show that under certain minimal and realistic distributional settings, it is possible to obtain a (1+ษ›)-approximation with a nearly linear running time and poly(k/ษ›) + O(k log n) columns. Namely, we show that if the input matrix A has the form A = B +E, where B is an arbitrary rank-k matrix, and E is a matrix with i.i.d.


Synthetic Programming Elicitation for Text-to-Code in Very Low-Resource Programming and Formal Languages Haley Lepe

Neural Information Processing Systems

Recent advances in large language models (LLMs) for code applications have demonstrated remarkable zero-shot fluency and instruction following on challenging code related tasks ranging from test case generation to self-repair. Unsurprisingly, however, models struggle to compose syntactically valid programs in programming languages unrepresented in pre-training, referred to as very lowresource Programming Languages (VLPLs). VLPLs appear in crucial settings, including domain-specific languages for internal tools, tool-chains for legacy languages, and formal verification frameworks. Inspired by a technique called natural programming elicitation, we propose designing an intermediate language that LLMs "naturally" know how to use and which can be automatically compiled to a target VLPL. When LLMs generate code that lies outside of this intermediate language, we use compiler techniques to repair the code into programs in the intermediate language. Overall, we introduce synthetic programming elicitation and compilation (SPEAC), an approach that enables LLMs to generate syntactically valid code even for VLPLs. We empirically evaluate the performance of SPEAC in a case study for the UCLID5 formal verification language and find that, compared to existing retrieval and fine-tuning baselines, SPEAC produces syntactically correct programs more frequently and without sacrificing semantic correctness.


Mutual Information Estimation via f-Divergence and Data Derangements

Neural Information Processing Systems

Estimating mutual information accurately is pivotal across diverse applications, from machine learning to communications and biology, enabling us to gain insights into the inner mechanisms of complex systems. Yet, dealing with high-dimensional data presents a formidable challenge, due to its size and the presence of intricate relationships. Recently proposed neural methods employing variational lower bounds on the mutual information have gained prominence. However, these approaches suffer from either high bias or high variance, as the sample size and the structure of the loss function directly influence the training process. In this paper, we propose a novel class of discriminative mutual information estimators based on the variational representation of the f-divergence. We investigate the impact of the permutation function used to obtain the marginal training samples and present a novel architectural solution based on derangements. The proposed estimator is flexible since it exhibits an excellent bias/variance trade-off. The comparison with state-of-the-art neural estimators, through extensive experimentation within established reference scenarios, shows that our approach offers higher accuracy and lower complexity.