Goto

Collaborating Authors

 sophia



The Robot and the Philosopher

The New Yorker

In the age of A.I., we endlessly debate what consciousness looks like. Can a camera see things more clearly? Earlier that day, she'd been onstage at the conference I was attending and had been teased for a gesture that looked as though she were flipping off the audience. Now she was in the hotel lobby, in a black gown, holding court. She stepped in front of a bright-orange wall. I had brought an 85-mm. "What are your hopes for the future of humanity?" She wasn't keen to answer, but she responded to the camera.


Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLMs

Lee, Jaeseong, Kwon, Dayoung, hwang, seung-won

arXiv.org Artificial Intelligence

Large Reasoning Models (LRMs) excel in structured tasks by emulating deliberate human reasoning but often suffer from overthinking, degrading performance and wasting resources. One possible baseline is to deploy both LLM and LRM, then route input by predicting whether it requires reasoning and may cause overthinking. However, deploying multiple models can be costly or impractical. We propose a superposed deployment strategy with a lightweight, training-free regulation to optimize inference by switching one model on and off. Instead of routing, we selectively unlearn from LRM at inference, scaling down computation while preserving reasoning. By analyzing the cumulative energy of singular values, we identify optimal low-rank projections to adjust reasoning just right.



Dimer-Enhanced Optimization: A First-Order Approach to Escaping Saddle Points in Neural Network Training

Hu, Yue, Cao, Zanxia, Liu, Yingchao

arXiv.org Machine Learning

First-order optimization methods, such as SGD and Adam, are widely used for training large-scale deep neural networks due to their computational efficiency and robust performance. However, relying solely on gradient information, these methods often struggle to navigate complex loss landscapes with flat regions, plateaus, and saddle points. Second-order methods, which use curvature information from the Hessian matrix, can address these challenges but are computationally infeasible for large models. The Dimer method, a first-order technique that constructs two closely spaced points to probe the local geometry of a potential energy surface, efficiently estimates curvature using only gradient information. Inspired by its use in molecular dynamics simulations for locating saddle points, we propose Dimer-Enhanced Optimization (DEO), a novel framework to escape saddle points in neural network training. DEO adapts the Dimer method to explore a broader region of the loss landscape, approximating the Hessian's smallest eigenvector without computing the full matrix. By periodically projecting the gradient onto the subspace orthogonal to the minimum curvature direction, DEO guides the optimizer away from saddle points and flat regions, enhancing training efficiency with non-stepwise updates. Preliminary experiments on a Transformer toy model show DEO achieves competitive performance compared to standard first-order methods, improving navigation of complex loss landscapes. Our work repurposes physics-inspired, first-order curvature estimation to enhance neural network training in high-dimensional spaces.


Pre-Training LLMs on a budget: A comparison of three optimizers

Schlotthauer, Joel, Kroos, Christian, Hinze, Chris, Hangya, Viktor, Hahn, Luzian, Küch, Fabian

arXiv.org Artificial Intelligence

Optimizers play a decisive role in reducing pre-training times for LLMs and achieving better-performing models. In this study, we compare three major variants: the de-facto standard AdamW, the simpler Lion, developed through an evolutionary search, and the second-order optimizer Sophia. For better generalization, we train with two different base architectures and use a single- and a multiple-epoch approach while keeping the number of tokens constant. Using the Maximal Update Parametrization and smaller proxy models, we tune relevant hyperparameters separately for each combination of base architecture and optimizer. We found that while the results from all three optimizers were in approximately the same range, Sophia exhibited the lowest training and validation loss, Lion was fastest in terms of training GPU hours but AdamW led to the best downstream evaluation results.


Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Xie, Tian, Gao, Zitian, Ren, Qingnan, Luo, Haoming, Hong, Yuqian, Dai, Bryan, Zhou, Joey, Qiu, Kai, Wu, Zhirong, Luo, Chong

arXiv.org Artificial Intelligence

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.


HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization

Zhao, Huaqin, Li, Jiaxi, Pan, Yi, Liang, Shizhe, Yang, Xiaofeng, Liu, Wei, Li, Xiang, Dou, Fei, Liu, Tianming, Lu, Jin

arXiv.org Artificial Intelligence

Fine-tuning large language models (LLMs) poses significant memory challenges, as the back-propagation process demands extensive resources, especially with growing model sizes. Recent work, MeZO, addresses this issue using a zerothorder (ZO) optimization method, which reduces memory consumption by matching the usage to the inference phase. To overcome this limitation, we introduce HELENE, a novel scalable and memory-efficient optimizer that integrates annealed A-GNB gradients with a diagonal Hessian estimation and layerwise clipping, serving as a second-order pre-conditioner. This combination allows for faster and more stable convergence. Our theoretical analysis demonstrates that HELENE improves convergence rates, particularly for models with heterogeneous layer dimensions, by reducing the dependency on the total parameter space dimension. Furthermore, HELENE remains compatible with both full parameter tuning and parameter-efficient fine-tuning (PEFT), outperforming several state-of-the-art optimizers. The codes will be released after reviewing. LLMs have demonstrated remarkable capabilities across various downstream tasks. Fine-tuning these models has become the standard approach for improving task-specific performance, in which the firstorder optimizers like Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951), Adam (Diederik, 2014) and AdamW (Hutter & Loshchilov, 2017) are widely used.


Automate or Assist? The Role of Computational Models in Identifying Gendered Discourse in US Capital Trial Transcripts

Wen-Yi, Andrea W, Adamson, Kathryn, Greenfield, Nathalie, Goldberg, Rachel, Babcock, Sandra, Mimno, David, Koenecke, Allison

arXiv.org Artificial Intelligence

The language used by US courtroom actors in criminal trials has long been studied for biases. However, systematic studies for bias in high-stakes court trials have been difficult, due to the nuanced nature of bias and the legal expertise required. New large language models offer the possibility to automate annotation, saving time and cost. But validating these approaches requires both high quantitative performance as well as an understanding of how automated methods fit in existing workflows, and what they really offer. In this paper we present a case study of adding an automated system to a complex and high-stakes problem: identifying gender-biased language in US capital trials for women defendants. Our team of experienced death-penalty lawyers and NLP technologists pursued a three-phase study: first annotating manually, then training and evaluating computational models, and finally comparing human annotations to model predictions. Unlike many typical NLP tasks, annotating for gender bias in months-long capital trials was a complicated task that involves with many individual judgment calls. In contrast to standard arguments for automation that are based on efficiency and scalability, legal experts found the computational models most useful in challenging their personal bias in annotation and providing opportunities to refine and build consensus on rules for annotation. This suggests that seeking to replace experts with computational models is both unrealistic and undesirable. Rather, computational models offer valuable opportunities to assist the legal experts in annotation-based studies.


Sylvester Stallone's daughters learned how to fight off a coyote, use pepper spray growing up: 'He is crazy'

FOX News

Sylvester Stallone wants his daughters, Sistine, Scarlet and Sophia, to be ready for anything. In new clips from the second season of their Paramount reality series, "The Family Stallone," Stallone spoke about his two eldest daughters, Sophia and Sistine, moving to New York, calling it "traumatic" as he recalled his own experiences with robbery, car accidents, and more. "Since you guys have moved to New York, it's made me very uneasy. You know I'm paranoid anyway because I have a responsibility as a father to do everything I can," he told them early in the episode. The girls then joked about him being "the most paranoid person on the planet," with the youngest daughter Scarlet saying "he is crazy!"