Goto

Collaborating Authors

 initial phase


The Crucial Role of Normalization in Sharpness-Aware Minimization Yan Dai

Neural Information Processing Systems

Sharpness-A ware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks.




61162d94822d468ee6e92803340f2040-Supplemental-Conference.pdf

Neural Information Processing Systems

In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster. We consider the setting in which allworkerssample fromthesamedataset, andcommunicate overasparsegraph (decentralized).


Chatbots can sway political opinions but are 'substantially' inaccurate, study finds

The Guardian

The study said tweaking a model after its initial phase of development was an importand factor in making it more persuasive. The study said tweaking a model after its initial phase of development was an importand factor in making it more persuasive. Chatbots can sway political opinions but are'substantially' inaccurate, study finds'Information-dense' AI responses are most persuasive but these tend to be less accurate, says security report Chatbots can sway people's political opinions but the most persuasive artificial intelligence models deliver "substantial" amounts of inaccurate information in the process, according to the UK government's AI security body. Researchers said the study was the largest and most systematic investigation of AI persuasiveness to date, involving nearly 80,000 British participants holding conversations with 19 different AI models. The AI Security Institute carried out the study amid fears that chatbots can be deployed for illegal activities including fraud and grooming.





Criteria and Bias of Parameterized Linear Regression under Edge of Stability Regime

arXiv.org Machine Learning

Classical optimization theory requires a small step-size for gradient-based methods to converge. Nevertheless, recent findings challenge the traditional idea by empirically demonstrating Gradient Descent (GD) converges even when the step-size $\eta$ exceeds the threshold of $2/L$, where $L$ is the global smooth constant. This is usually known as the Edge of Stability (EoS) phenomenon. A widely held belief suggests that an objective function with subquadratic growth plays an important role in incurring EoS. In this paper, we provide a more comprehensive answer by considering the task of finding linear interpolator $\beta \in R^{d}$ for regression with loss function $l(\cdot)$, where $\beta$ admits parameterization as $\beta = w^2_{+} - w^2_{-}$. Contrary to the previous work that suggests a subquadratic $l$ is necessary for EoS, our novel finding reveals that EoS occurs even when $l$ is quadratic under proper conditions. This argument is made rigorous by both empirical and theoretical evidence, demonstrating the GD trajectory converges to a linear interpolator in a non-asymptotic way. Moreover, the model under quadratic $l$, also known as a depth-$2$ diagonal linear network, remains largely unexplored under the EoS regime. Our analysis then sheds some new light on the implicit bias of diagonal linear networks when a larger step-size is employed, enriching the understanding of EoS on more practical models.


Dynamics of Concept Learning and Compositional Generalization

arXiv.org Machine Learning

Prior work has shown that text-conditioned diffusion models can learn to identify and manipulate primitive concepts underlying a compositional data-generating process, enabling generalization to entirely novel, out-of-distribution compositions. Beyond performance evaluations, these studies develop a rich empirical phenomenology of learning dynamics, showing that models generalize sequentially, respecting the compositional hierarchy of the data-generating process. Moreover, concept-centric structures within the data significantly influence a model's speed of learning the ability to manipulate a concept. In this paper, we aim to better characterize these empirical results from a theoretical standpoint. Specifically, we propose an abstraction of prior work's compositional generalization problem by introducing a structured identity mapping (SIM) task, where a model is trained to learn the identity mapping on a Gaussian mixture with structurally organized centroids. We mathematically analyze the learning dynamics of neural networks trained on this SIM task and show that, despite its simplicity, SIM's learning dynamics capture and help explain key empirical observations on compositional generalization with diffusion models identified in prior work. Our theory also offers several new insights -- e.g., we find a novel mechanism for non-monotonic learning dynamics of test loss in early phases of training. We validate our new predictions by training a text-conditioned diffusion model, bridging our simplified framework and complex generative models. Overall, this work establishes the SIM task as a meaningful theoretical abstraction of concept learning dynamics in modern generative models.