Goto

Collaborating Authors

 harmless






Towards Understanding the Cognitive Habits of Large Reasoning Models

Dong, Jianshuo, Fu, Yujia, Hu, Chuanrui, Zhang, Chao, Qiu, Han

arXiv.org Artificial Intelligence

Large Reasoning Models (LRMs), which autonomously produce a reasoning Chain of Thought (CoT) before producing final responses, offer a promising approach to interpreting and monitoring model behaviors. Inspired by the observation that certain CoT patterns -- e.g., ``Wait, did I miss anything?'' -- consistently emerge across tasks, we explore whether LRMs exhibit human-like cognitive habits. Building on Habits of Mind, a well-established framework of cognitive habits associated with successful human problem-solving, we introduce CogTest, a principled benchmark designed to evaluate LRMs' cognitive habits. CogTest includes 16 cognitive habits, each instantiated with 25 diverse tasks, and employs an evidence-first extraction method to ensure reliable habit identification. With CogTest, we conduct a comprehensive evaluation of 16 widely used LLMs (13 LRMs and 3 non-reasoning ones). Our findings reveal that LRMs, unlike conventional LLMs, not only exhibit human-like habits but also adaptively deploy them according to different tasks. Finer-grained analyses further uncover patterns of similarity and difference in LRMs' cognitive habit profiles, particularly certain inter-family similarity (e.g., Qwen-3 models and DeepSeek-R1). Extending the study to safety-related tasks, we observe that certain habits, such as Taking Responsible Risks, are strongly associated with the generation of harmful responses. These findings suggest that studying persistent behavioral patterns in LRMs' CoTs is a valuable step toward deeper understanding of LLM misbehavior. The code is available at: https://github.com/jianshuod/CogTest.


Review for NeurIPS paper: Overfitting Can Be Harmless for Basis Pursuit, But Only to a Degree

Neural Information Processing Systems

Weaknesses: First of all I should say that I like this paper. The following should be taken more like'issues in need of clarification' or'things a reader might be confused about' rather than'weaknesses'. As far as I can tell, Thm 2 and Prop 4 don't resolve this, and I find the experimental evidence difficult to interpret (more on that below). In particular, I'd be curious if it's possible to get rid of the constant term on the RHS of (9)? First, I'd expect the risk curves (of both l1 and l2 minimisers) to be decreasing in p, isn't that what this is all about? Second, it is claimed in Section 3(i), that the risk of l1-minimisers is unaffected by the norm of beta, but there is a clear difference between the green and the orange curve (BP, beta norm 1 or 0.1).


Review for NeurIPS paper: Overfitting Can Be Harmless for Basis Pursuit, But Only to a Degree

Neural Information Processing Systems

The paper is a good technical contribution to a phenomenon of widespread interest to the NeurIPS community. While some reviewers had initial concerns about the somewhat strong assumptions (range of validity of parameters, Gaussianity), the technical hurdles to overcome are significant and after the discussion, they were ameliorated.


Towards Safer Social Media Platforms: Scalable and Performant Few-Shot Harmful Content Moderation Using Large Language Models

Bonagiri, Akash, Li, Lucen, Oak, Rajvardhan, Babar, Zeerak, Wojcieszak, Magdalena, Chhabra, Anshuman

arXiv.org Artificial Intelligence

The prevalence of harmful content on social media platforms poses significant risks to users and society, necessitating more effective and scalable content moderation strategies. Current approaches rely on human moderators, supervised classifiers, and large volumes of training data, and often struggle with scalability, subjectivity, and the dynamic nature of harmful content (e.g., violent content, dangerous challenge trends, etc.). To bridge these gaps, we utilize Large Language Models (LLMs) to undertake few-shot dynamic content moderation via in-context learning. Through extensive experiments on multiple LLMs, we demonstrate that our few-shot approaches can outperform existing proprietary baselines (Perspective and OpenAI Moderation) as well as prior state-of-the-art few-shot learning methods, in identifying harm. We also incorporate visual information (video thumbnails) and assess if different multimodal techniques improve model performance. Our results underscore the significant benefits of employing LLM based methods for scalable and dynamic harmful content moderation online.


Overfitting Can Be Harmless for Basis Pursuit, But Only to a Degree

Neural Information Processing Systems

Recently, there have been significant interests in studying the so-called "double-descent" of the generalization error of linear regression models under the overparameterized and overfitting regime, with the hope that such analysis may provide the first step towards understanding why overparameterized deep neural networks (DNN) still generalize well. However, to date most of these studies focused on the min L2-norm solution that overfits the data. In contrast, in this paper we study the overfitting solution that minimizes the L1-norm, which is known as Basis Pursuit (BP) in the compressed sensing literature. Gaussian features, we show that for a large range of p up to a limit that grows exponentially with the number of samples n, with high probability the model error of BP is upper bounded by a value that decreases with p. To the best of our knowledge, this is the first analytical result in the literature establishing the double-descent of overfitting BP for finite n and p. Further, our results reveal significant differences between the double-descent of BP and min L2-norm solutions. Specifically, the double-descent upper-bound of BP is independent of the signal strength, and for high SNR and sparse models the descent-floor of BP can be much lower and wider than that of min L2-norm solutions.


Geometric-Averaged Preference Optimization for Soft Preference Labels

Furuta, Hiroki, Lee, Kuang-Huei, Gu, Shixiang Shane, Matsuo, Yutaka, Faust, Aleksandra, Zen, Heiga, Gur, Izzeddin

arXiv.org Artificial Intelligence

Many algorithms for aligning LLMs with human preferences assume that human preferences are binary and deterministic. However, it is reasonable to think that they can vary with different individuals, and thus should be distributional to reflect the fine-grained relationship between the responses. In this work, we introduce the distributional soft preference labels and improve Direct Preference Optimization (DPO) with a weighted geometric average of the LLM output likelihood in the loss function. In doing so, the scale of learning loss is adjusted based on the soft labels, and the loss with equally preferred responses would be close to zero. This simple modification can be easily applied to any DPO family and helps the models escape from the over-optimization and objective mismatch prior works suffer from. In our experiments, we simulate the soft preference labels with AI feedback from LLMs and demonstrate that geometric averaging consistently improves performance on standard benchmarks for alignment research. In particular, we observe more preferable responses than binary labels and significant improvements with data where modestly-confident labels are in the majority.