yogurt
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling
Yun, Heecheol, Ki, Kwangmin, Lee, Junghyun, Yang, Eunho
Ensembling Large Language Models (LLMs) has gained attention as a promising approach to surpass the performance of individual models by leveraging their complementary strengths. In particular, aggregating models' next-token probability distributions to select the next token has been shown to be effective in various tasks. However, while successful for short-form answers, its application to long-form generation remains underexplored. In this paper, we show that using existing ensemble methods in long-form generation requires a careful choice of ensembling positions, since the standard practice of ensembling at every token often degrades performance. We identify two key factors for determining these positions: tokenization mismatch across models and consensus in their next-token probability distributions. Based on this, we propose SAFE, (Stable And Fast LLM Ensembling), a framework that selectively ensembles by jointly considering these factors. To further improve stability, we introduce a probability sharpening strategy that consolidates probabilities spread across multiple sub-word tokens representing the same word into a single representative token. Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (4 more...)
Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization
Li, Yang, Dong, Zhichen, Sun, Yuhan, Wang, Weixun, Xiong, Shaopan, Luo, Yijia, Liu, Jiashun, Lu, Han, Wang, Jiamang, Su, Wenbo, Zheng, Bo, Yan, Junchi
The reasoning pattern of Large language models (LLMs) remains opaque, and Reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.
Ants can be used to make 'tangy' yogurt
Environment Animals Insects Ants can be used to make'tangy' yogurt An old family recipe from Bulgaria goes under the microscope. Breakthroughs, discoveries, and DIY tips sent every weekday. Typically, we humans do our best to keep ants out of our kitchens and away from our food . But an old and almost forgotten recipe harnesses the power of these hard-working insects to make yogurt. The recipe, which was once common across Turkey (or Türkiye) and the Balkans, has been recreated in a study published today in the journal .
- Asia > Middle East > Republic of Türkiye (0.56)
- Europe > Bulgaria (0.27)
- Europe > Denmark > Capital Region > Copenhagen (0.05)
- (11 more...)
LLM2: Let Large Language Models Harness System 2 Reasoning
Yang, Cheng, Shi, Chufan, Li, Siheng, Shui, Bo, Yang, Yujiu, Lam, Wai
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs. We posit that these limitations are rooted in the foundational autoregressive architecture of LLMs, which inherently lacks mechanisms for differentiating between desirable and undesirable results. Drawing inspiration from the dual-process theory of human cognition, we introduce LLM2, a novel framework that combines an LLM (System 1) with a process-based verifier (System 2). Within LLM2, the LLM is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs. The verifier is trained with a pairwise comparison loss on synthetic process-supervision data generated through our token quality exploration strategy. Empirical results on mathematical reasoning benchmarks substantiate the efficacy of LLM2, exemplified by an accuracy enhancement from 50.3 to 57.8 (+7.5) for Llama3-1B on GSM8K. Furthermore, when combined with self-consistency, LLM2 achieves additional improvements, boosting major@20 accuracy from 56.2 to 70.2 (+14.0).
- North America > Dominican Republic (0.04)
- Asia > Singapore (0.04)
- Asia > China > Hong Kong (0.04)
LinguaLIFT: An Effective Two-stage Instruction Tuning Framework for Low-Resource Language Tasks
Zhang, Hongbin, Chen, Kehai, Bai, Xuefeng, Xiang, Yang, Zhang, Min
Large language models (LLMs) have demonstrated impressive multilingual understanding and reasoning capabilities, driven by extensive pre-training multilingual corpora and fine-tuning instruction data. However, a performance gap persists between high-resource and low-resource language tasks due to language imbalance in the pre-training corpus, even using more low-resource data during fine-tuning. To alleviate this issue, we propose LinguaLIFT, a two-stage instruction tuning framework for advancing low-resource language tasks. An additional language alignment layer is first integrated into the LLM to adapt a pre-trained multilingual encoder, thereby enhancing multilingual alignment through code-switched fine-tuning. The second stage fine-tunes LLM with English-only instruction data while freezing the language alignment layer, allowing LLM to transfer task-specific capabilities from English to low-resource language tasks. Additionally, we introduce the Multilingual Math World Problem (MMWP) benchmark, which spans 21 low-resource, 17 medium-resource, and 10 high-resource languages, enabling comprehensive evaluation of multilingual reasoning. Experimental results show that LinguaLIFT outperforms several competitive baselines across MMWP and other widely used benchmarks.
Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology
Xie, Wei, Ma, Shuoyoucheng, Wang, Zhenhua, Wang, Enze, Chen, Kai, Sun, Xiaobing, Wang, Baosheng
The cognitive mechanism by which Large Language Models (LLMs) solve mathematical problems remains a widely debated and unresolved issue. Currently, there is little interpretable experimental evidence that connects LLMs' problem-solving with human cognitive psychology.To determine if LLMs possess human-like mathematical reasoning, we modified the problems used in the human Cognitive Reflection Test (CRT). Our results show that, even with the use of Chains of Thought (CoT) prompts, mainstream LLMs, including the latest o1 model (noted for its reasoning capabilities), have a high error rate when solving these modified CRT problems. Specifically, the average accuracy rate dropped by up to 50% compared to the original questions.Further analysis of LLMs' incorrect answers suggests that they primarily rely on pattern matching from their training data, which aligns more with human intuition (System 1 thinking) rather than with human-like reasoning (System 2 thinking). This finding challenges the belief that LLMs have genuine mathematical reasoning abilities comparable to humans. As a result, this work may adjust overly optimistic views on LLMs' progress towards artificial general intelligence.
EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models
Chen, Yi, Ge, Yuying, Ge, Yixiao, Ding, Mingyu, Li, Bohao, Wang, Rui, Xu, Ruifeng, Shan, Ying, Liu, Xihui
Multimodal Large Language Models (MLLMs), building upon the powerful Large Language Models (LLMs) with exceptional reasoning and generalization capability, have opened up new avenues for embodied task planning. MLLMs excel in their ability to integrate diverse environmental inputs, such as real-time task progress, visual observations, and open-form language instructions, which are crucial for executable task planning. In this work, we introduce a benchmark with human annotations, EgoPlan-Bench, to quantitatively investigate the potential of MLLMs as embodied task planners in real-world scenarios. Our benchmark is distinguished by realistic tasks derived from real-world videos, a diverse set of actions involving interactions with hundreds of different objects, and complex visual observations from varied environments. We evaluate various open-source MLLMs, revealing that these models have not yet evolved into embodied planning generalists (even GPT-4V). We further construct an instruction-tuning dataset EgoPlan-IT from videos of human-object interactions, to facilitate the learning of high-level task planning in intricate real-world situations. The experiment results demonstrate that the model tuned on EgoPlan-IT not only significantly improves performance on our benchmark, but also effectively acts as embodied planner in simulations.
- Asia > China > Hong Kong (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
Large Language Models Cannot Self-Correct Reasoning Yet
Huang, Jie, Chen, Xinyun, Mishra, Swaroop, Zheng, Huaixiu Steven, Yu, Adams Wei, Song, Xinying, Zhou, Denny
Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities across various applications. Nevertheless, concerns persist regarding the accuracy and appropriateness of their generated content. A contemporary methodology, self-correction, has been proposed as a remedy to these issues. Building upon this premise, this paper critically examines the role and efficacy of self-correction within LLMs, shedding light on its true potential and limitations. Central to our investigation is the notion of intrinsic self-correction, whereby an LLM attempts to correct its initial responses based solely on its inherent capabilities, without the crutch of external feedback. In the context of reasoning, our research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance might even degrade post self-correction. Drawing from these insights, we offer suggestions for future research and practical applications in this field.
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.05)
- North America > United States > Illinois (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic
Zhao, Xufeng, Li, Mengdi, Lu, Wenhao, Weber, Cornelius, Lee, Jae Hee, Chu, Kun, Wermter, Stefan
Recent advancements in large language models have showcased their remarkable generalizability across various domains. However, their reasoning abilities still have significant room for improvement, especially when confronted with scenarios requiring multi-step reasoning. Although large language models possess extensive knowledge, their behavior, particularly in terms of reasoning, often fails to effectively utilize this knowledge to establish a coherent thinking paradigm. Generative language models sometimes show hallucinations as their reasoning procedures are unconstrained by logical principles. Aiming to improve the zero-shot chain-of-thought reasoning ability of large language models, we propose Logical Chain-of-Thought (LogiCoT), a neurosymbolic framework which leverages principles from symbolic logic to verify and revise the reasoning processes accordingly. Experimental evaluations conducted on language tasks in diverse domains, including arithmetic, commonsense, symbolic, causal inference, and social problems, demonstrate the efficacy of the enhanced reasoning paradigm by logic.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Asia > China (0.04)
Danone pins turnaround hopes on AI
Making the yogurt of the future requires a cast of 21st-century helpers: machine learning, gut science and even a mysterious artificial stomach. At a new Danone facility near Paris, researchers feed dollops of yogurt into globular glass vessels and plastic tubes designed to mimic the human gut. Once the bacteria inside show they can survive the digestive juices, artificial intelligence is put to work to probe their potential health benefits. To consumers bombarded with claims about the supposed power of probiotics, the goal may sound familiar: souped-up yogurt. But the owner of Activia and Actimel is betting technology can yield answers on which friendly bacteria work best and why, giving its products a scientific edge at a time when revenue is lagging and consumers are growing wary of processed food. This could be due to a conflict with your ad-blocking or security software.