Goto

Collaborating Authors

 hexagon


WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate

Cherian, Anoop, Doyle, River, Ben-Dov, Eyal, Lohit, Suhas, Peng, Kuan-Chuan

arXiv.org Artificial Intelligence

Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents' solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.


A Hierarchical Hybrid AI Approach: Integrating Deep Reinforcement Learning and Scripted Agents in Combat Simulations

Black, Scotty, Darken, Christian

arXiv.org Artificial Intelligence

In the domain of combat simulations in support of wargaming, the development of intelligent agents has predominantly been characterized by rule-based, scripted methodologies with deep reinforcement learning (RL) approaches only recently being introduced. While scripted agents offer predictability and consistency in controlled environments, they fall short in dynamic, complex scenarios due to their inherent inflexibility. Conversely, RL agents excel in adaptability and learning, offering potential improvements in handling unforeseen situations, but suffer from significant challenges such as black-box decision-making processes and scalability issues in larger simulation environments. This paper introduces a novel hierarchical hybrid artificial intelligence (AI) approach that synergizes the reliability and predictability of scripted agents with the dynamic, adaptive learning capabilities of RL. By structuring the AI system hierarchically, the proposed approach aims to utilize scripted agents for routine, tactical-level decisions and RL agents for higher-level, strategic decision-making, thus addressing the limitations of each method while leveraging their individual strengths. This integration is shown to significantly improve overall performance, providing a robust, adaptable, and effective solution for developing and training intelligent agents in complex simulation environments.


Infinite folds

MIT Technology Review

But her passion is for paper--with no scissors. Today, she's a tessellation expert who teaches, invents new designs, and writes papers on the underlying math. Madonna Yoder '17 photographed in her studio Ross Mantle When Madonna Yoder '17 was eight years old, she learned how to fold a square piece of paper over and over and over again. After about 16 folds, she held a bird in her hands. The first time she pulled the tail of a flapping crane, she says, she realized: . That first piece was an origami classic, folded by kids at summer camp for generations and many people's first foray into the art form.


Fold your own tessellation

MIT Technology Review

Yoder recommends printing the pattern on paper in between normal printer paper and cardstock in weight, making sure it folds in straight lines (not too thick), folds back and forth easily on the same line (not too thin), and is crisp enough to make a satisfying snapping noise when you shake it. Her favorite paper isSkytone, which is commonly used to print certificates and fancy envelopes. Once you have your crease pattern on a sheet of paper, cut out the hexagon that contains the pattern. Yoder recommends using a straightedge and blade on a cutting mat instead of scissors, whether that means an X-Acto knife and a ruler on a sheet of cardboard or a quilting ruler and rotary cutter on a fabric cutting mat. The next step is folding the background grid of black lines that the pattern uses as references. Assuming you've cut out your hexagon precisely, you can use the edge of the hexagon and the printed lines to make your creases, or you can fold as if there were no lines printed by folding the hexagon in half (edge to opposite edge) and then folding those edges in to the center to make quarter lines, first in one direction and then in the other two.


Thinker: Learning to Think Fast and Slow

Chung, Stephen, Du, Wenyu, Fu, Jie

arXiv.org Artificial Intelligence

Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 25.6% to 27.3% for Qwen2.5-1.5B, and from 45.9% to 51.0% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 25.2% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training. Additionally, we have open-sourced both the trained models and the source code.


I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Galichin, Andrey, Dontsov, Alexey, Druzhinina, Polina, Razzhigaev, Anton, Rogov, Oleg Y., Tutubalina, Elena, Oseledets, Ivan

arXiv.org Artificial Intelligence

Recent LLMs like DeepSeek-R1 have demonstrated state-of-the-art performance by integrating deep thinking and complex reasoning during generation. However, the internal mechanisms behind these reasoning processes remain unexplored. We observe reasoning LLMs consistently use vocabulary associated with human reasoning processes. We hypothesize these words correspond to specific reasoning moments within the models' internal mechanisms. To test this hypothesis, we employ Sparse Autoencoders (SAEs), a technique for sparse decomposition of neural network activations into human-interpretable features. We introduce ReasonScore, an automatic metric to identify active SAE features during these reasoning moments. We perform manual and automatic interpretation of the features detected by our metric, and find those with activation patterns matching uncertainty, exploratory thinking, and reflection. Through steering experiments, we demonstrate that amplifying these features increases performance on reasoning-intensive benchmarks (+2.2%) while producing longer reasoning traces (+20.5%). Using the model diffing technique, we provide evidence that these features are present only in models with reasoning capabilities. Our work provides the first step towards a mechanistic understanding of reasoning in LLMs. Code available at https://github.com/AIRI-Institute/SAE-Reasoning


Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models

Liu, Yule, Zheng, Jingyi, Sun, Zhen, Peng, Zifan, Dong, Wenhan, Sha, Zeyang, Cui, Shiwen, Wang, Weiqiang, He, Xinlei

arXiv.org Artificial Intelligence

Recent advancements in large reasoning models (LRMs) have demonstrated the effectiveness of scaling test-time computation to enhance reasoning capabilities on various tasks. However, LRMs often suffer from an ``overthinking'' problem, where the model generates excessively redundant reasoning steps with limited performance gains. In this work, we empirically reveal an important characteristic of LRM behaviors that placing external CoTs generated by smaller models between the thinking token (\texttt{} and \texttt{}) can effectively manipulate the model to generate fewer thoughts. Building on this finding, we propose a simple yet efficient pipeline, \Method, to enable LRMs to bypass unnecessary intermediate steps, thereby significantly reducing computational costs. We conduct extensive experiments to evaluate the utility and efficiency of \Method. For instance, when applied to QwQ-32B on the LiveBench/Code dataset, \Method keeps the original performance while reducing output token counts by approximately 30\%, with minimal overhead introduced by the CoT generator. Furthermore, we identify two suboptimal modes, blindly following flawed external thoughts and unnecessary rethinking, and show that simple mitigations, such as difficulty-aware fallbacks, can further improve performance. Overall, \Method offers a practical, general, and efficient way to optimize LRM inference, making powerful reasoning models more accessible and scalable for real-world applications.


Job-killing robot learns at work, and it's coming to the factory floor

FOX News

Industries can rethink how work gets done, raising the bar for productivity and workplace safety. Across industries, companies are feeling the squeeze from labor shortages, rising costs and nonstop pressure to boost efficiency. Robots are quickly becoming real-life solutions, and their promise has never felt more relevant. With factories and warehouses scrambling to fill essential roles, the search for fresh ideas is heating up. That's where AEON comes in.


Validating remotely sensed biomass estimates with forest inventory data in the western US

Cao, Xiuyu, Sexton, Joseph O., Wang, Panshi, Gounaridis, Dimitrios, Carter, Neil H., Zhu, Kai

arXiv.org Artificial Intelligence

Monitoring aboveground biomass (AGB) and its density (AGBD) at high resolution is essential for carbon accounting and ecosystem management. While NASA's spaceborne Global Ecosystem Dynamics Investigation (GEDI) LiDAR mission provides globally distributed reference measurements for AGBD estimation, the majority of commercial remote sensing products based on GEDI remain without rigorous or independent validation. Here, we present an independent regional validation of an AGBD dataset offered by terraPulse, Inc., based on independent reference data from the US Forest Service Forest Inventory and Analysis (FIA) program. Aggregated to 64,000-hectare hexagons and US counties across the US states of Utah, Nevada, and Washington, we found very strong agreement between terraPulse and FIA estimates. At the hexagon scale, we report R2 = 0.88, RMSE = 26.68 Mg/ha, and a correlation coefficient (r) of 0.94. At the county scale, agreement improves to R2 = 0.90, RMSE =32.62 Mg/ha, slope = 1.07, and r = 0.95. Spatial and statistical analyses indicated that terraPulse AGBD values tended to exceed FIA estimates in non-forest areas, likely due to FIA's limited sampling of non-forest vegetation. The terraPulse AGBD estimates also exhibited lower values in high-biomass forests, likely due to saturation effects in its optical remote-sensing covariates. This study advances operational carbon monitoring by delivering a scalable framework for comprehensive AGBD validation using independent FIA data, as well as a benchmark validation of a new commercial dataset for global biomass monitoring.


Why do so many AI company logos look like buttholes?

New Scientist

Feedback is New Scientist's popular sideways look at the latest science and technology news. You can submit items you believe may amuse readers to Feedback by emailing feedback@newscientist.com The past few years have seen the emergence of a great many AI companies. This is extremely exciting/alarming (delete according to whether you bought shares early), but it has also had a secondary consequence. Along with the proliferation of AI companies has come a proliferation of AI company logos.