quadrilateral
Variational Reasoning for Language Models
Zhou, Xiangxin, Liu, Zichen, Wang, Haonan, Du, Chao, Lin, Min, Li, Chongxuan, Wang, Liang, Pang, Tianyu
We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.
A bio-inspired sand-rolling robot: effect of body shape on sand rolling performance
Liao, Xingjue, Liu, Wenhao, Wu, Hao, Qian, Feifei
The capability of effectively moving on complex terrains such as sand and gravel can empower our robots to robustly operate in outdoor environments, and assist with critical tasks such as environment monitoring, search-and-rescue, and supply delivery. Inspired by the Mount Lyell salamander's ability to curl its body into a loop and effectively roll down {\Revision hill slopes}, in this study we develop a sand-rolling robot and investigate how its locomotion performance is governed by the shape of its body. We experimentally tested three different body shapes: Hexagon, Quadrilateral, and Triangle. We found that Hexagon and Triangle can achieve a faster rolling speed on sand, but exhibited more frequent failures of getting stuck. Analysis of the interaction between robot and sand revealed the failure mechanism: the deformation of the sand produced a local ``sand incline'' underneath robot contact segments, increasing the effective region of supporting polygon (ERSP) and preventing the robot from shifting its center of mass (CoM) outside the ERSP to produce sustainable rolling. Based on this mechanism, a highly-simplified model successfully captured the critical body pitch for each rolling shape to produce sustained rolling on sand, and informed design adaptations that mitigated the locomotion failures and improved robot speed by more than 200$\%$. Our results provide insights into how locomotors can utilize different morphological features to achieve robust rolling motion across deformable substrates.
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
Jia, Yiming, Li, Jiachen, Yue, Xiang, Li, Bo, Nie, Ping, Zou, Kai, Chen, Wenhu
Vision-Language Models have made significant progress on many perception-focused tasks. However, their progress on reasoning-focused tasks remains limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity of reasoning-focused multimodal datasets. We propose VisualWebInstruct, a novel approach that leverages search engines to create a diverse and high-quality dataset spanning multiple disciplines, including mathematics, physics, finance, and chemistry, etc. Starting with a meticulously selected set of 30,000 seed images, we employ Google Image Search to identify websites containing similar images. We collect and process HTML data from over 700K unique URLs. Through a pipeline of content extraction, filtering, and synthesis, we construct a dataset of approximately 900K question-answer (QA) pairs, with 40% consisting of visual QA pairs and the remaining comprising text-based QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance improvements: (1) fine-tuning on Llava-OV results in 10-20 absolute points improvement across benchmarks, and (2) fine-tuning from MAmmoTH-VL yields a 5 absolute points gain across benchmarks. Our best model, MAmmoTH-VL2, achieves state-of-the-art performance within the 10B parameter class on MMMU-Pro (40.7), MathVerse (42.6), and DynaMath (55.7). These results highlight the effectiveness of our dataset in enhancing the reasoning capabilities of vision-language models for complex multimodal tasks.
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information
Kamoi, Ryo, Zhang, Yusen, Das, Sarkar Snigdha Sarathi, Zhang, Ranran Haoran, Zhang, Rui
Errors in understanding visual information in images (i.e., visual perception errors) remain a major source of mistakes in Large Vision Language Models (LVLMs). While further analysis is essential, there is a deficiency in datasets for evaluating the visual perception of LVLMs. In this work, we introduce VisOnlyQA, a new dataset designed to directly evaluate the visual perception capabilities of LVLMs on questions about geometric and numerical information in scientific figures. Our dataset enables us to analyze the visual perception of LVLMs for fine-grained visual information, independent of other capabilities such as reasoning. The evaluation set of VisOnlyQA includes 1,200 multiple-choice questions in 12 tasks on four categories of figures. We also provide synthetic training data consisting of 70k instances. Our experiments on VisOnlyQA highlight the following findings: (i) 20 LVLMs we evaluate, including GPT-4o and Gemini 1.5 Pro, work poorly on the visual perception tasks in VisOnlyQA, while human performance is nearly perfect. (ii) Fine-tuning on synthetic training data demonstrates the potential for enhancing the visual perception of LVLMs, but observed improvements are limited to certain tasks and specific models. (iii) Stronger language models improve the visual perception of LVLMs. In summary, our experiments suggest that both training data and model architectures should be improved to enhance the visual perception capabilities of LVLMs. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.
Symmetry Preservation in Swarms of Oblivious Robots with Limited Visibility
Gerlach, Raphael, von der Gracht, Sören, Hahn, Christopher, Harbig, Jonas, Kling, Peter
In the general pattern formation (GPF) problem, a swarm of simple autonomous, disoriented robots must form a given pattern. The robots' simplicity imply a strong limitation: When the initial configuration is rotationally symmetric, only patterns with a similar symmetry can be formed [Yamashita, Suzyuki; TCS 2010]. The only known algorithm to form large patterns with limited visibility and without memory requires the robots to start in a near-gathering (a swarm of constant diameter) [Hahn et al.; SAND 2024]. However, not only do we not know any near-gathering algorithm guaranteed to preserve symmetry but most natural gathering strategies trivially increase symmetries [Castenow et al.; OPODIS 2022]. Thus, we study near-gathering without changing the swarm's rotational symmetry for disoriented, oblivious robots with limited visibility (the OBLOT-model, see [Flocchini et al.; 2019]). We introduce a technique based on the theory of dynamical systems to analyze how a given algorithm affects symmetry and provide sufficient conditions for symmetry preservation. Until now, it was unknown whether the considered OBLOT-model allows for any non-trivial algorithm that always preserves symmetry. Our first result shows that a variant of Go-to-the-Average always preserves symmetry but may sometimes lead to multiple, unconnected near-gathering clusters. Our second result is a symmetry-preserving near-gathering algorithm that works on swarms with a convex boundary (the outer boundary of the unit disc graph) and without holes (circles of diameter 1 inside the boundary without any robots).
Human-Like Geometric Abstraction in Large Pre-trained Neural Networks
Campbell, Declan, Kumar, Sreejan, Giallanza, Tyler, Griffiths, Thomas L., Cohen, Jonathan D.
Specifically, we apply that can capture regularities in the external world. By neural network models to behavioral tasks from recent empirical forming abstractions that can generalize to future experience, work (Sablé-Meyer et al., 2021, 2022; Hsu, Wu, & humans are able to exhibit efficient learning and strong generalization Goodman, 2022) that catalogue three effects indicative of abstraction across domains (Lake, Salakhutdinov, & Tenenbaum, in human geometric reasoning. First, humans are 2015; Hull, 1920). One domain in which this has sensitive to geometric complexity, such that they are slower been observed by cognitive scientists is geometric reasoning to recall complex images as compared to simpler ones (Sablé- (Dehaene, Al Roumi, Lakretz, Planton, & Sablé-Meyer, Meyer et al., 2022). Second, humans are sensitive to geometric 2022), where people consistently extract abstract concepts, regularity (based on features such as right angles, parallel such as parallelism, symmetry, and convexity, that generalize sides, and symmetry) such that they are able to classify regular across many visual instances.
Accelerating hyperbolic t-SNE
Skrodzki, Martin, van Geffen, Hunter, Chaves-de-Plaza, Nicolas F., Höllt, Thomas, Eisemann, Elmar, Hildebrandt, Klaus
The need to understand the structure of hierarchical or high-dimensional data is present in a variety of fields. Hyperbolic spaces have proven to be an important tool for embedding computations and analysis tasks as their non-linear nature lends itself well to tree or graph data. Subsequently, they have also been used in the visualization of high-dimensional data, where they exhibit increased embedding performance. However, none of the existing dimensionality reduction methods for embedding into hyperbolic spaces scale well with the size of the input data. That is because the embeddings are computed via iterative optimization schemes and the computation cost of every iteration is quadratic in the size of the input. Furthermore, due to the non-linear nature of hyperbolic spaces, Euclidean acceleration structures cannot directly be translated to the hyperbolic setting. This paper introduces the first acceleration structure for hyperbolic embeddings, building upon a polar quadtree. We compare our approach with existing methods and demonstrate that it computes embeddings of similar quality in significantly less time. Implementation and scripts for the experiments can be found at https://graphics.tudelft.nl/accelerating-hyperbolic-tsne.
Theorem Discovery Amongst Cyclic Polygons
In [8], a characterization is made of linear systems involving angle bisection conditions which are not full rank. In such a system, one of the conditions is implied by the remainder, and, if the angle bisections are interpreted geometrically, this dependence may be stated in a number of different ways as a geometry theorem. The characterization leads to a catalog of such linear systems. An approach to theorem discovery is proposed wherein a linear system is initially selected, and then interpreted geometrically as a theorem. In [7], a program is described which applies this approach, constructing a particular geometry theorem corresponding to a randomly selected linear system from the catalog. In order to reduce diagram complexity, the program is biased in favor of constructing cyclic polygons wherever possible. In the case where it is able to construct a cyclic polygon using all the rows of the linear system, the theorem which is produced has the following form. Given a cyclic 2n-gon, where n 1 specified pairs of sides are parallel, then a final specified pair of sides is also parallel. For example, in a cyclic hexagon, with two pairs of opposite sides parallel, the third pair of sides is also parallel.
Minimax Analysis for Inverse Risk in Nonparametric Planer Invertible Regression
Okuno, Akifumi, Imaizumi, Masaaki
We study a minimax risk of estimating inverse functions on a plane, while keeping an estimator is also invertible. Learning invertibility from data and exploiting an invertible estimator are used in many domains, such as statistics, econometrics, and machine learning. Although the consistency and universality of invertible estimators have been well investigated, analysis on the efficiency of these methods is still under development. In this study, we study a minimax risk for estimating invertible bi-Lipschitz functions on a square in a $2$-dimensional plane. We first introduce an inverse $L^2$-risk to evaluate an estimator which preserves invertibility. Then, we derive lower and upper rates for a minimax inverse risk by exploiting a representation of invertible functions using level-sets. To obtain an upper bound, we develop an estimator asymptotically almost everywhere invertible, whose risk attains the derived minimax lower rate up to logarithmic factors. The derived minimax rate corresponds to that of the non-invertible bi-Lipschitz function, which rejects the expectation of whether invertibility improves the minimax rate, similar to other shape constraints.
Spatially and Seamlessly Hierarchical Reinforcement Learning for State Space and Policy space in Autonomous Driving
Despite advances in hierarchical reinforcement learning, its applications to path planning in autonomous driving on highways are challenging. One reason is that conventional hierarchical reinforcement learning approaches are not amenable to autonomous driving due to its riskiness: the agent must move avoiding multiple obstacles such as other agents that are highly unpredictable, thus safe regions are small, scattered, and changeable over time. To overcome this challenge, we propose a spatially hierarchical reinforcement learning method for state space and policy space. The high-level policy selects not only behavioral sub-policy but also regions to pay mind to in state space and for outline in policy space. Subsequently, the low-level policy elaborates the short-term goal position of the agent within the outline of the region selected by the high-level command. The network structure and optimization suggested in our method are as concise as those of single-level methods. Experiments on the environment with various shapes of roads showed that our method finds the nearly optimal policies from early episodes, outperforming a baseline hierarchical reinforcement learning method, especially in narrow and complex roads. The resulting trajectories on the roads were similar to those of human strategies on the behavioral planning level.