swan
SBSC: Step-By-Step Coding for Improving Mathematical Olympiad Performance
Singh, Kunal, Biswas, Ankan, Bhowmick, Sayandeep, Moturi, Pradeep, Gollapalli, Siva Kishore
We propose Step-by-Step Coding (SBSC): a multi-turn math reasoning framework that enables Large Language Models (LLMs) to generate sequence of programs for solving Olympiad level math problems. At each step/turn, by leveraging the code execution outputs and programs of previous steps, the model generates the next sub-task and the corresponding program to solve it. This way, SBSC, sequentially navigates to reach the final answer. SBSC allows more granular, flexible and precise approach to problem-solving compared to existing methods. Extensive experiments highlight the effectiveness of SBSC in tackling competition and Olympiad-level math problems. For Claude-3.5-Sonnet, we observe SBSC (greedy decoding) surpasses existing state-of-the-art (SOTA) program generation based reasoning strategies by absolute 10.7% on AMC12, 8% on AIME and 12.6% on MathOdyssey. Given SBSC is multi-turn in nature, we also benchmark SBSC's greedy decoding against self-consistency decoding results of existing SOTA math reasoning strategies and observe performance gain by absolute 6.2% on AMC, 6.7% on AIME and 7.4% on MathOdyssey.
Gradient Multi-Normalization for Stateless and Scalable LLM Training
Scetbon, Meyer, Ma, Chao, Gong, Wenbo, Meeds, Edward
Training large language models (LLMs) typically relies on adaptive optimizers like Adam (Kingma & Ba, 2015) which store additional state information to accelerate convergence but incur significant memory overhead. Recent efforts, such as SWAN (Ma et al., 2024) address this by eliminating the need for optimizer states while achieving performance comparable to Adam via a multi-step preprocessing procedure applied to instantaneous gradients. Motivated by the success of SWAN, we introduce a novel framework for designing stateless optimizers that normalizes stochastic gradients according to multiple norms. To achieve this, we propose a simple alternating scheme to enforce the normalization of gradients w.r.t these norms. We show that our procedure can produce, up to an arbitrary precision, a fixed-point of the problem, and that SWAN is a particular instance of our approach with carefully chosen norms, providing a deeper understanding of its design. However, SWAN's computationally expensive whitening/orthogonalization step limit its practicality for large LMs. Using our principled perspective, we develop of a more efficient, scalable, and practical stateless optimizer. Our algorithm relaxes the properties of SWAN, significantly reducing its computational cost while retaining its memory efficiency, making it applicable to training large-scale models. Experiments on pre-training LLaMA models with up to 1 billion parameters demonstrate a 3X speedup over Adam with significantly reduced memory requirements, outperforming other memory-efficient baselines.
SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training
Ma, Chao, Gong, Wenbo, Scetbon, Meyer, Meeds, Edward
Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require to maintain optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b). In this work, we show that pre-processing SGD in a stateless manner can achieve the same performance as the Adam optimizer for LLM training, while drastically reducing the memory cost. Specifically, we propose to pre-process the instantaneous stochastic gradients using normalization and whitening. We show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN has the same memory footprint as SGD, achieving $\approx 50\%$ reduction on total end-to-end memory compared to Adam. In language modeling tasks, SWAN demonstrates comparable or even better performance than Adam: when pre-training the LLaMA model with 350M and 1.3B parameters, SWAN achieves a 2x speedup by reaching the same evaluation perplexity using half as many tokens.
Firing Pat Gelsinger doesn't solve Intel's problems
Despite Intel's recent woes, I didn't expect to see CEO Pat Gelsinger joining 15,000 or so of his colleagues being shown the door. Gelsinger is a storied engineer and business success who laid down an exhaustive rescue plan when he took the helm of the beleaguered chipmaker in 2021. It was never going to be a quick fix, given the company's long legacy of missteps. Gelsinger may be the public face of Intel's current malaise, but the problems started long before his tenure and will likely keep going. Gelsinger was tasked with addressing almost two decades' worth of bad decisions, all of which have compounded.
Design-o-meter: Towards Evaluating and Refining Graphic Designs
Goyal, Sahil, Mahajan, Abhinav, Mishra, Swasti, Udhayanan, Prateksha, Shukla, Tripti, Joseph, K J, Srinivasan, Balaji Vasan
Graphic designs are an effective medium for visual communication. They range from greeting cards to corporate flyers and beyond. Off-late, machine learning techniques are able to generate such designs, which accelerates the rate of content production. An automated way of evaluating their quality becomes critical. Towards this end, we introduce Design-o-meter, a data-driven methodology to quantify the goodness of graphic designs. Further, our approach can suggest modifications to these designs to improve its visual appeal. To the best of our knowledge, Design-o-meter is the first approach that scores and refines designs in a unified framework despite the inherent subjectivity and ambiguity of the setting. Our exhaustive quantitative and qualitative analysis of our approach against baselines adapted for the task (including recent Multimodal LLM-based approaches) brings out the efficacy of our methodology. We hope our work will usher more interest in this important and pragmatic problem setting.
Scene-wise Adaptive Network for Dynamic Cold-start Scenes Optimization in CTR Prediction
Li, Wenhao, Zhou, Jie, Luo, Chuan, Tang, Chao, Zhang, Kun, Zhao, Shixiong
In the realm of modern mobile E-commerce, providing users with nearby commercial service recommendations through location-based online services has become increasingly vital. While machine learning approaches have shown promise in multi-scene recommendation, existing methodologies often struggle to address cold-start problems in unprecedented scenes: the increasing diversity of commercial choices, along with the short online lifespan of scenes, give rise to the complexity of effective recommendations in online and dynamic scenes. In this work, we propose Scene-wise Adaptive Network (SwAN), a novel approach that emphasizes high-performance cold-start online recommendations for new scenes. Our approach introduces several crucial capabilities, including scene similarity learning, user-specific scene transition cognition, scene-specific information construction for the new scene, and enhancing the diverged logical information between scenes. We demonstrate SwAN's potential to optimize dynamic multi-scene recommendation problems by effectively online handling cold-start recommendations for any newly arrived scenes. More encouragingly, SwAN has been successfully deployed in Meituan's online catering recommendation service, which serves millions of customers per day, and SwAN has achieved a 5.64% CTR index improvement relative to the baselines and a 5.19% increase in daily order volume proportion.
Tackling Graph Oversquashing by Global and Local Non-Dissipativity
Gravina, Alessio, Eliasof, Moshe, Gallicchio, Claudio, Bacciu, Davide, Schรถnlieb, Carola-Bibiane
A common problem in Message-Passing Neural Networks is oversquashing -- the limited ability to facilitate effective information flow between distant nodes. Oversquashing is attributed to the exponential decay in information transmission as node distances increase. This paper introduces a novel perspective to address oversquashing, leveraging properties of global and local non-dissipativity, that enable the maintenance of a constant information flow rate. Namely, we present SWAN, a uniquely parameterized model GNN with antisymmetry both in space and weight domains, as a means to obtain non-dissipativity. Our theoretical analysis asserts that by achieving these properties, SWAN offers an enhanced ability to transmit information over extended distances. Empirical evaluations on synthetic and real-world benchmarks that emphasize long-range interactions validate the theoretical understanding of SWAN, and its ability to mitigate oversquashing.
Beyond Empirical Windowing: An Attention-Based Approach for Trust Prediction in Autonomous Vehicles
Niu, Minxue, Zheng, Zhaobo, Akash, Kumar, Misu, Teruhisa
Humans' internal states play a key role in human-machine interaction, leading to the rise of human state estimation as a prominent field. Compared to swift state changes such as surprise and irritation, modeling gradual states like trust and satisfaction are further challenged by label sparsity: long time-series signals are usually associated with a single label, making it difficult to identify the critical span of state shifts. Windowing has been one widely-used technique to enable localized analysis of long time-series data. However, the performance of downstream models can be sensitive to the window size, and determining the optimal window size demands domain expertise and extensive search. To address this challenge, we propose a Selective Windowing Attention Network (SWAN), which employs window prompts and masked attention transformation to enable the selection of attended intervals with flexible lengths. We evaluate SWAN on the task of trust prediction on a new multimodal driving simulation dataset. Experiments show that SWAN significantly outperforms an existing empirical window selection baseline and neural network baselines including CNN-LSTM and Transformer. Furthermore, it shows robustness across a wide span of windowing ranges, compared to the traditional windowing approach.
SWAN: A Generic Framework for Auditing Textual Conversational Systems
We argue that such frameworks should satisfy the following requirements at least. Alertness They should detect potential problems with extremely high recall (i.e., near-zero misses), while appropriately crediting the benefits of the conversational systems. Moreover, when aiming for high recall, different people involved (i.e., not just users, but also workers who label data for training the system, etc.) should be taken into account; in particular, if the evaluation framework ignores some negative impacts on marginalised people, it does not satisfy the alertness requirement. Specificity By this we mean that the evaluation framework should be specific when locating the problem(s) within conversations. For example, an evaluation result that says"There is a problem somewhere inside this conversation session" is less useful than one that says"There is a problem in this particular system turn," which in turn is less useful than one that says "There is a problem in this particular claim within this system turn."
Can "The Last of Us" Break the Curse of Bad Video-Game Adaptations?
This content can also be viewed on the site it originates from. When the British actor Bob Hoskins agreed to star in "Super Mario Bros.," he had little sense of what he was getting into. The year was 1992, and, although the title on which the film was based had sold tens of millions of copies, a feature-length live-action adaptation of a video game had never been attempted. The movie's eventual tagline, "This ain't no game," reflected a self-conscious distance from its source material: a convoluted parallel-universe plot recast the heroes as Italian American handymen from Brooklyn and the princess they set out to save as an N.Y.U. Hoskins himself hadn't even heard of the Nintendo franchise--but when his kids learned that he would be playing Mario they excitedly showed him the game.