Understanding the Role of Momentum in Stochastic Gradient Methods
Igor Gitman, Hunter Lang, Pengchuan Zhang, Lin Xiao
The use of momentum in stochastic gradient methods has become a widespread practice in machine learning. Different variants of momentum, including heavyball momentum, Nesterov's accelerated gradient (NAG), and quasi-hyperbolic momentum (QHM), have demonstrated success on various tasks. Despite these empirical successes, there is a lack of clear understanding of how the momentum parameters affect convergence and various performance measures of different algorithms. In this paper, we use the general formulation of QHM to give a unified analysis of several popular algorithms, covering their asymptotic convergence conditions, stability regions, and properties of their stationary distributions. In addition, by combining the results on convergence rates and stationary distributions, we obtain sometimes counter-intuitive practical guidelines for setting the learning rate and momentum parameters.
InfoGCL: Information-Aware Graph Contrastive Learning
Various graph contrastive learning models have been proposed to improve the performance of learning tasks on graph datasets in recent years. While effective and prevalent, these models are usually carefully customized. In particular, although all recent researches create two contrastive views, they differ greatly in view augmentations, architectures, and objectives. It remains an open question how to build your graph contrastive learning model from scratch for particular graph learning tasks and datasets. In this work, we aim to fill this gap by studying how graph information is transformed and transferred during the contrastive learning process and proposing an information-aware graph contrastive learning framework called InfoGCL. The key point of this framework is to follow the Information Bottleneck principle to reduce the mutual information between contrastive parts while keeping task-relevant information intact at both the levels of the individual module and the entire framework so that the information loss during graph representation learning can be minimized. We show for the first time that all recent graph contrastive learning methods can be unified by our framework. We empirically validate our theoretical analysis on both node and graph classification benchmark datasets, and demonstrate that our algorithm significantly outperforms the state-of-the-arts.
Anthropic's newest Claude AI models are experts at programming
Yesterday in an announcement blog post, AI company Anthropic unveiled Claude 4, its new generation of AI models consisting of Claude 4 Opus and Claude 4 Sonnet with a range of new abilities. Both Claude 4 models are hybrid models, which means they're capable of giving you short-and-quick answers or thinking longer on their responses with deeper reasoning. Claude 4 Opus is excellent at solving complex problems and at programming. The model can maintain its performance in long tasks over several hours with thousands of different steps. Meanwhile, Anthropic says Claude 4 Sonnet is a huge upgrade over Claude 3.7 Sonnet's abilities.
Robots square off in world's first humanoid boxing match
Breakthroughs, discoveries, and DIY tips sent every weekday. After decades of being tortured, shoved, kicked, burned, and bludgeoned, robots are finally getting their chance to fight back. This weekend, Chinese robotics maker Unitree says it will livestream the world's first boxing match between two of its humanoid robots. The event, titled Unitree Iron Fist King: Awakening, will feature a face-off between two of Unitree's 4.3-foot-tall G1 robots. The robots will reportedly be remotely controlled by human engineers, though they are also expected to demonstrate some autonomous, pre-programmed actions as well.
Appendix
Figure 9: Example showing how a single line of HTML code is rendered by a browser's renderer. In this example, we can see that the tags
delimit different blocks which are therefore spaced by line breaks while other tags, such as , are rendered on the same line of text that precedes and follows them.
Fast, Provably convergent IRLS Algorithm for p-norm Linear Regression
Deeksha Adil, Richard Peng, Sushant Sachdeva
Iteratively Reweighted Least Squares (IRLS) is an easy to implement family of algorithms for solving these problems that has been studied for over 50 years. However, these algorithms often diverge for p>3, and since the work of Osborne (1985), it has been an open problem whether there is an IRLS algorithm that is guaranteed to converge rapidly for p>3. We propose p-IRLS, the first IRLS algorithm that provably converges geometrically for any p 2 [2, 1). Our algorithm is simple to implement and is guaranteed to find a high accuracy solution in a sub-linear number of iterations. Our experiments demonstrate that it performs even better than our theoretical bounds, beats the standard Matlab/CVX implementation for solving these problems by 10-50x, and is the fastest among available implementations in the high-accuracy regime.
Microsoft is now testing AI-generated text in Windows Notepad
As of yesterday, Microsoft has begun rolling out a new update to Windows 11 Insiders on the Dev and Canary Channels. This update brings new AI features to Notepad, Paint, and the Snipping Tool. Notepad now has the ability to write text from scratch using generative AI, which is meant to aid you by quickly producing drafts based on your prompts and instructions. To use AI text generation, simply right-click anywhere in the document and select Write. Type in your instructions, then either click Keep Text or Discard on the results.
Interpreting Learned Feedback Patterns in Large Language Models Luke Marks Amir Abdullah Clement Neo
Reinforcement learning from human feedback (RLHF) is widely used to train large language models (LLMs). However, it is unclear whether LLMs accurately learn the underlying preferences in human feedback data. We coin the term Learned Feedback Pattern (LFP) for patterns in an LLM's activations learned during RLHF that improve its performance on the fine-tuning task. We hypothesize that LLMs with LFPs accurately aligned to the fine-tuning feedback exhibit consistent activation patterns for outputs that would have received similar feedback during RLHF. To test this, we train probes to estimate the feedback signal implicit in the activations of a fine-tuned LLM. We then compare these estimates to the true feedback, measuring how accurate the LFPs are to the fine-tuning feedback. Our probes are trained on a condensed, sparse and interpretable representation of LLM activations, making it easier to correlate features of the input with our probe's predictions. We validate our probes by comparing the neural features they correlate with positive feedback inputs against the features GPT-4 describes and classifies as related to LFPs. Understanding LFPs can help minimize discrepancies between LLM behavior and training objectives, which is essential for the safety and alignment of LLMs.
Why the argument for WFH could get a big boost from AI
The pandemic changed how people worked, shifting most professionals to remote or hybrid models. For the software company Atlassian, this flexible, distributed approach persists to this day. "We have 13,000 employees spread across the globe, and individuals can choose their working location every day," said Annie Dean, Head of Team Anywhere, Atlassian's distributed work policy. "It's about how we work, not where we work." The implementation of the flexible model has produced positive effects for employees and the company alike. Internal data reveals that even though only 34% of employees have opted to work from home, 92% of Atlassian employees reported that the ability to work from anywhere allows them to perform their best, and 91% said it's an important reason for staying at the company.
Average-Case Averages: Private Algorithms for Smooth Sensitivity and Mean Estimation
The simplest and most widely applied method for guaranteeing differential privacy is to add instance-independent noise to a statistic of interest that is scaled to its global sensitivity. However, global sensitivity is a worst-case notion that is often too conservative for realized dataset instances. We provide methods for scaling noise in an instance-dependent way and demonstrate that they provide greater accuracy under average-case distributional assumptions. Specifically, we consider the basic problem of privately estimating the mean of a real distribution from i.i.d.