step 0
ARoto translation invariance
A.1 Rotations in 2 dimensions In 2-dimensional settings, there exists a single scalar angular position, the yaw angle ฮธ. In order to perform the transformation, we have to express the angular positions in a format suitable for linear transformations; we do so by transforming them to rotation matrices, perform a matrix multiplication, and then transform the angular positions back to angle format. In 2 dimensions, we use eq. After the rotation, we can convert them back to angle format using the 2-argument arc-tangent function: ฮธ = atan2(sinฮธ,cosฮธ) (14) Simplified rotations In 2 dimensions, the computations can be simplified since rotations commute. First, we show that chained rotations result in angle addition/subtraction, that is: Q(ฮธi) Q(ฮธj) = cosฮธi sinฮธi sinฮธicosฮธi cosฮธj sinฮธj sinฮธjcosฮธj (15) = cosฮธicosฮธj sinฮธisinฮธj cosฮธisinฮธj sinฮธicosฮธj sinฮธicosฮธj +cosฮธisinฮธj sinฮธisinฮธj +cosฮธicosฮธj (16) = cos(ฮธi +ฮธj) sin(ฮธi +ฮธj) sin(ฮธi +ฮธj) cos(ฮธi +ฮธj) (17) = Q(ฮธi +ฮธj) (18) Following the same approach, we compute the inverse rotation: Q (ฮธi) Q(ฮธj) = Q( ฮธi) Q(ฮธj) = Q(ฮธj ฮธi) (19) Thus, instead of rotating the angular positions (expressed in rotation matrix form) using the rotation matrix Q, in practice we perform the transformation directly to the angles via addition/subtraction, and replace the matrix Qwith the identity matrix I1 1.
Latent Field Discovery In Interacting Dynamical Systems With Neural Fields
Systems of interacting objects often evolve under the influence of field effects that govern their dynamics, yet previous works have abstracted away from such effects, and assume that systems evolve in a vacuum. In this work, we focus on discovering these fields, and infer them from the observed dynamics alone, without directly observing them.
Supplementary Contents
Theauthors T admits singularvalue( i, i, i)i2I forsomeI, with1= 0 1 .. , i :X!Rand i :Z!R, i.e.T i = i iandT i = i i. Moreo T operatoras: Th= X Figure 5: Estimated rbfkernelwith =.1and1000samples. Thentheestimatorpresented in Equation(4), satisfiesthatw.p.1 : kT(หh h0)k2 O r rs log (pn) n + r log ( 1/ ) n !
Roto-translatedLocalCoordinateFrames ForInteractingDynamicalSystems
First,weintroduce canonicalized roto-translated local coordinate frames for interacting dynamical systems formalized in geometric graphs. Second, by operating solely on these coordinate frames, we enable roto-translation invariant edge prediction and roto-translation equivariant trajectory forecasting. Third, we present anovelmethodology for natural anisotropic continuous filters based onrelativelinear and angular positions ofneighboring objects in the canonicalized local coordinate frames.
Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection
Mavi, Vaibhav, Jaroria, Shubh, Sun, Weiqi
Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.
Investigating the Invertibility of Multimodal Latent Spaces: Limitations of Optimization-Based Methods
This paper investigates the inverse capabilities and broader utility of multimodal latent spaces within task-specific AI (Artificial Intelligence) models. While these models excel at their designed forward tasks (e.g., text-to-image generation, audio-to-text transcription), their potential for inverse mappings remains largely unexplored. We propose an optimization-based framework to infer input characteristics from desired outputs, applying it bidirectionally across Text-Image (BLIP, Flux.1-dev) and Text-Audio (Whisper-Large-V3, Chatterbox-TTS) modalities. Our central hypothesis posits that while optimization can guide models towards inverse tasks, their multimodal latent spaces will not consistently support semantically meaningful and perceptually coherent inverse mappings. Experimental results consistently validate this hypothesis. We demonstrate that while optimization can force models to produce outputs that align textually with targets (e.g., a text-to-image model generating an image that an image captioning model describes correctly, or an ASR model transcribing optimized audio accurately), the perceptual quality of these inversions is chaotic and incoherent. Furthermore, when attempting to infer the original semantic input from generative models, the reconstructed latent space embeddings frequently lack semantic interpretability, aligning with nonsensical vocabulary tokens. These findings highlight a critical limitation. multimodal latent spaces, primarily optimized for specific forward tasks, do not inherently possess the structure required for robust and interpretable inverse mappings. Our work underscores the need for further research into developing truly semantically rich and invertible multimodal latent spaces.
LLM-Flock: Decentralized Multi-Robot Flocking via Large Language Models and Influence-Based Consensus
Large Language Models (LLMs) have advanced rapidly in recent years, demonstrating strong capabilities in problem comprehension and reasoning. Inspired by these developments, researchers have begun exploring the use of LLMs as decentralized decision-makers for multi-robot formation control. However, prior studies reveal that directly applying LLMs to such tasks often leads to unstable and inconsistent behaviors, where robots may collapse to the centroid of their positions or diverge entirely due to hallucinated reasoning, logical inconsistencies, and limited coordination awareness. To overcome these limitations, we propose a novel framework that integrates LLMs with an influence-based plan consensus protocol. In this framework, each robot independently generates a local plan toward the desired formation using its own LLM. The robots then iteratively refine their plans through a decentralized consensus protocol that accounts for their influence on neighboring robots. This process drives the system toward a coherent and stable flocking formation in a fully decentralized manner. We evaluate our approach through comprehensive simulations involving both state-of-the-art closed-source LLMs (e.g., o3-mini, Claude 3.5) and open-source models (e.g., Llama3.1-405b, Qwen-Max, DeepSeek-R1). The results show notable improvements in stability, convergence, and adaptability over previous LLM-based methods. We further validate our framework on a physical team of Crazyflie drones, demonstrating its practical viability and effectiveness in real-world multi-robot systems.