cube
Contrastive Representations for Temporal Reasoning
In classical AI, perception relies on learning state-based representations, while planning -- temporal reasoning over action sequences -- is typically achieved through search. We study whether such reasoning can instead emerge from representations that capture both perceptual and temporal structure. We show that standard temporal contrastive learning, despite its popularity, often fails to capture temporal structure due to its reliance on spurious features. To address this, we introduce Contrastive Representations for Temporal Reasoning (CRTR), a method that uses a negative sampling scheme to provably remove these spurious features and facilitate temporal reasoning. CRTR achieves strong results on domains with complex temporal structure, such as Sokoban and Rubik's Cube. In particular, for the Rubik's Cube, CRTR learns representations that generalize across all initial states and allow it to solve the puzzle using fewer search steps than BestFS -- though with longer solutions. To our knowledge, this is the first method that efficiently solves arbitrary Cube states using only learned representations, without relying on an external search algorithm.
Latent Chain-of-Thought for Visual Reasoning
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.
724711fccb09d4519cbbb6d245d3675d-Paper-Conference.pdf
In lar the this ge y paper rely BD. on Ho, we the we quantify v DP' er, s producing information-g the DP' these s information-g athe beha ring viors progress, athering is challenging which progress is in for visible by the estimatPP to the being observ the ations, prediction and uncertainty accordingly of propose privileged the observ frame ations work reconstructed of Uncertainty-Sensiti from partial ve Pri combines PP vile to choose ged re Learning w actions ard transformation (USPL).
SOLIDGEO: Measuring Multimodal Spatial Math Reasoning in Solid Geometry
Geometry is a fundamental branch of mathematics and plays a crucial role in evaluating the reasoning capabilities of multimodal large language models (MLLMs). However, existing multimodal mathematics benchmarks mainly focus on plane geometry and largely ignore solid geometry, which requires spatial reasoning and is more challenging than plane geometry. To address this critical gap, we introduce SOLIDGEO, the first large-scale benchmark specifically designed to evaluate the performance of MLLMs on mathematical reasoning tasks in solid geometry.
GUARDIAN: Safeguarding LLMMulti-Agent Collaborations with Temporal Graph Modeling
The emergence of large language models (LLMs) enables the development of intelligent agents capable of engaging in complex and multi-turn dialogues. However, multi-agent collaboration faces critical safety challenges, such as hallucination amplification and error injection and propagation. This paper presents GUARDIAN, a unified method for detecting and mitigating multiple safety concerns in GUARDing Intelligent Agent collaboratioNs. By modeling the multi-agent collaboration process as a discrete-time temporal attributed graph, GUARDIAN explicitly captures the propagation dynamics of hallucinations and errors. The unsupervised encoder-decoder architecture incorporating an incremental training paradigm learns to reconstruct node attributes and graph structures from latent embeddings, enabling the identification of anomalous nodes and edges with unparalleled precision. Moreover, we introduce a graph abstraction mechanism based on the Information Bottleneck Theory, which compresses temporal interaction graphs while preserving essential patterns. Extensive experiments demonstrate GUARDIAN's effectiveness in safeguarding LLM multi-agent collaborations against diverse safety vulnerabilities, achieving state-of-the-art accuracy with efficient resource utilization.
Contrastive Representations for Temporal Reasoning
In classical AI, perception relies on learning state-based representations, while planning --- temporal reasoning over action sequences --- is typically achieved through search. We study whether such reasoning can instead emerge from representations that capture both perceptual and temporal structure. We show that standard temporal contrastive learning, despite its popularity, often fails to capture temporal structure due to its reliance on spurious features. To address this, we introduce Contrastive Representations for Temporal Reasoning (CRTR), a method that uses a negative sampling scheme to provably remove these spurious features and facilitate temporal reasoning. CRTR achieves strong results on domains with complex temporal structure, such as Sokoban and Rubik's Cube. In particular, for the Rubik's Cube, CRTR learns representations that generalize across all initial states and allow it to solve the puzzle using fewer search steps than BestFS -- though with longer solutions. To our knowledge, this is the first method that efficiently solves arbitrary Cube states using only learned representations, without relying on an external search algorithm.
There's a new skydiving Rubik's Cube-solving champ in town, but there's one big problem with this feat
Jemele Hill says she feels'terribly sad' for Karmelo Anthony because his lawyer was white Five of the most unhinged fan theories that make'The Sopranos' a re-watchable masterpiece'Whalefall' trailer is here to add getting swallowed by a sperm whale while SCUBA diving to your list of fears Christopher Nolan's'The Odyssey' uncorks a Trojan Horse popcorn bucket that stores the goods in its crotch New trailer released for upcoming post-apocalyptic thriller'The Dog Stars' with Jacob Elordi'House of the Dragon' Season 3 premiere runtime and details revealed for hit HBO series You're not getting away with watering your grass with your'crank' out on Sheriff Grady Judd's watch Taylor Sheridan's hit CIA/military series'Lioness' gets official season release date on Paramount+ It wasn't on his shopping list, but a man managed to accidentally shoot himself in the groin at Walmart anyway Trump's Iran deal announcement sends markets skyrocketing, oil prices tumble Trump's Iran deal will not change regime's terror behavior, expert warns Paul Mauro: Crockett's weapon argument lacks'basic algebraic logic' Trump says Iran agreement documents are in'final shape,' signing soon Former Navy lieutenant commander says Iran doesn't'have a whole lot to work with' Massive national sporting events fuel market of'illicit trafficking,' says ex-DOJ prosector Doug Burgum praises Trump's leadership on rolling back regulations Iranian oil operations face'nuclear option' as US blockade traps ships Mike Pompeo: A piece of paper is'largely worthless' to the Iranian regime Trump says Iran will sign a deal'by this weekend' A solid WEEK after election night, progressive Nithya Raman has suddenly surged into the lead in LA--leaving voters completely flabbergasted. Few things amaze me like people who can solve a Rubik's Cube. Sure, lots of things amaze me more -- mountains, elaborate water features, how my dog sits on the couch and watches like he's super into it -- but it's a very specific kind of amazement that's like, Man, that's wild; I could never do that... nor do I really care to. But I like that other people are super into it to the point that there's now a Guinness World Record cottage industry of people solving them under different circumstances, and we've got a new top dog when it comes to solving a bunch of them while skydiving. A Rubik's Cube, the ultimate test of dexterity and spinning colored blocks.
A Cubing Strategy for Identifying Stable Hyperparameter Regions for Uncertainty Quantification in Spatial Deep Learning
Amouzou, Isaac, Lee, Ben Seiyon
Spatially referenced datasets have become increasingly prevalent across many fields, largely driven by advances in data collection methods such as satellite remote sensing. In many applications, predictions at unobserved locations are accompanied by reliable uncertainty estimates. While deep learning methods provide both scalable and accurate models for spatial predictions, there remains no clear consensus for addressing uncertainty quantification in spatial deep learning. Monte Carlo (MC) dropout has become a popular approach for uncertainty quantification, yet existing implementations typically focus on tuning the dropout rate while fixing other influential hyperparameters, such as weight decay and the predictive standard deviation multiplier, often through ad-hoc or manual tuning. We propose a cubing-based diagnostic framework that recursively partitions the hyperparameter space to identify stable regions where MC dropout yields well-calibrated predictive intervals. The approach evaluates hyperparameter regions using scoring rules relative to a statistical baseline model, which serves as a calibration anchor. Through a simulation study spanning multiple spatial dependence regimes as well as a large remotely-sensed land surface temperature dataset, we demonstrate that our approach produces competitive or superior predictive intervals compared to the baseline model. Our methodology provides practitioners with a systematic procedure for incorporating uncertainty quantification into spatial deep learning models.
Appendix614 Table of Contents
Incorporating causality into reinforcement learning methods increases the interpretability of artificial636 intelligence, which helps humans understand the underlying mechanism of algorithms and check637 the source of failures. However, the learned causal transition model may contain human-readable638 private information about the environment, which could raise privacy issues. To mitigate this potential639 negative societal impact, the causal transition model needs to be encrypted and only accessible to640 algorithms and trustworthy users.641 In this section, besides the most related formulation, robust RL introduced in Sec 3.3, we also643 introduce some other related RL problem formulations partially shown in Figure 3. Then, we limit644 our discussion to mainly two lines of work that are related to ours: (1) promoting robustness in RL;645 (2) concerning the spurious correlation issues in RL.646 B.1 Related RL formulations647 Robustness to noisy state: POMDPs and SA-MDPs.
Beyond Aesthetics: Cultural Competence in Text-to-Image Models
Text-to-Image (T2I) models are being increasingly adopted in diverse global communities where they create visual representations of their unique cultures. Current T2I benchmarks primarily focus on faithfulness, aesthetics, and realism of generated images, overlooking the critical dimension of . In this work, we introduce a framework to evaluate cultural competence of T2I models along two crucial dimensions: cultural awareness and cultural diversity, and present a scalable approach using a combination of structured knowledge bases and large language models to build a large dataset of cultural artifacts to enable this evaluation. In particular, we apply this approach to build CUBE (CUltural BEnchmark for Text-to-Image models), a first-of-its-kind benchmark to evaluate cultural competence of T2I models. CUBE covers cultural artifacts associated with 8 countries across different geo-cultural regions and along 3 concepts: cuisine, landmarks, and art. CUBE consists of 1) CUBE-1K, a set of high-quality prompts that enable the evaluation of cultural awareness, and 2) CUBE-CSpace, a larger dataset of cultural artifacts that serves as grounding to evaluate cultural diversity. We also introduce cultural diversity as a novel T2I evaluation component, leveraging quality-weighted Vendi score. Our evaluations reveal significant gaps in the cultural awareness of existing models across countries and provide valuable insights into the cultural diversity of T2I outputs for underspecified prompts. Our methodology is extendable to other cultural regions and concepts and can facilitate the development of T2I models that better cater to the global population.