scissors
1cc70be9fb6a83bc46cf4ac21a91e0b0-Supplemental-Conference.pdf
In this section, we provide the class assignment of all datasets under different missing rates. The proposed setting is anew multi-task learning scenario. Its practical applications could not be limited by the mentioned assumption in the testing space. Table B.2: The observed classes of each task onOffice-Caltech with different missing rates. Office-Home [9] contains images from four domains/tasks: Artistic, Clipart, Product and Realworld. Skin-Lesion contains three skin lesion classification tasks: HAM10000 [8], Dermofit [2] and Derm7pt[5].
Wish List 2025: A WIRED Gift Guide
These thoughtful and beautifully designed gifts will delight everyone in your circle. Craighill, the maker of desk accessories, key rings, and other fine lifestyle items, has kept things simple and functional with its Chroma scissors. The stainless steel slicers are sturdy and easy to use, but also gorgeous, with primary accent colors that help the scissors stand out in a crowded pen cup. Craighill has been racing to keep these in stock, but they're proving incredibly popular, so don't wait. A hooded parka is a must for anyone who lives where the ground freezes. This is one of Patagonia's toastiest offerings, with a waterproof shell stuffed with 700-fill-power down, a fully insulated cinchable hood, and storm flaps on the inside and outside of the zipper.
- Asia > Nepal (0.14)
- North America > United States > New York (0.05)
- South America > French Guiana > Guyane > Cayenne (0.04)
- (5 more...)
- Leisure & Entertainment > Games > Computer Games (0.94)
- Media > Music (0.68)
- Health & Medicine (0.68)
Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction
Yin, Congchi, Wu, Tianyi, Shu, Yankai, Gu, Alex, Wang, Yunhan, Shao, Jun, Jiang, Xun, Li, Piji
Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment. This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning, neglecting the integrated reasoning process that is indispensable for humans discovery of real world. We introduce a novel evaluation paradigm, \textit{black-box interaction}, to tackle this challenge. A black-box is defined by a hidden function that maps a specific set of inputs to outputs. LLMs are required to unravel the hidden function behind the black-box by interacting with it in given exploration turns, and reasoning over observed input-output pairs. Leveraging this idea, we build the \textsc{Oracle} benchmark which comprises 6 types of black-box task and 96 black-boxes. 19 modern LLMs are benchmarked. o3 ranks first in 5 of the 6 tasks, achieving over 70\% accuracy on most easy black-boxes. But it still struggles with some hard black-box tasks, where its average performance drops below 40\%. Further analysis indicates a universal difficulty among LLMs: They lack the high-level planning capability to develop efficient and adaptive exploration strategies for hypothesis refinement.
- Transportation > Air (1.00)
- Leisure & Entertainment > Games (1.00)
Understanding Human Limits in Pattern Recognition: A Computational Model of Sequential Reasoning in Rock, Paper, Scissors
Cross, Logan, Brockbank, Erik, Gerstenberg, Tobias, Fan, Judith E., Yamins, Daniel L. K., Haber, Nick
How do we predict others from patterns in their behavior and what are the computational constraints that limit this ability? We investigate these questions by modeling human behavior over repeated games of rock, paper, scissors from Brockbank & Vul (2024). Against algorithmic opponents that varied in strategic sophistication, people readily exploit simple transition patterns (e.g., consistently playing rock after paper) but struggle to detect more complex sequential dependencies. To understand the cognitive mechanisms underlying these abilities and their limitations, we deploy Hypothetical Minds (HM), a large language model-based agent that generates and tests hypotheses about opponent strategies, as a cognitive model of this behavior (Cross et al., 2024). We show that when applied to the same experimental conditions, HM closely mirrors human performance patterns, succeeding and failing in similar ways. To better understand the source of HM's failures and whether people might face similar cognitive bottlenecks in this context, we performed a series of ablations and augmentations targeting different components of the system. When provided with natural language descriptions of the opponents' strategies, HM successfully exploited 6/7 bot opponents with win rates >80% suggesting that accurate hypothesis generation is the primary cognitive bottleneck in this task. Further, by systematically manipulating the model's hypotheses through pedagogically-inspired interventions, we find that the model substantially updates its causal understanding of opponent behavior, revealing how model-based analyses can produce testable hypotheses about human cognition.
Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation
Yu, Albert, Li, Chengshu, Macesanu, Luca, Balaji, Arnav, Ray, Ruchira, Mooney, Raymond, Martín-Martín, Roberto
Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot's capabilities may change over time. This demands a tightly coupled communication loop that grants both agents the flexibility to propose, accept, or decline requests as they coordinate toward completing the task effectively. We apply a Mixed-Initiative dialog paradigm to Collaborative human-roBot teaming and propose MICoBot, a system that handles the common scenario where both agents, using natural language, take initiative in formulating, accepting, or rejecting proposals on who can best complete different steps of a task. To handle diverse, task-directed dialog, and find successful collaborative strategies that minimize human effort, MICoBot makes decisions at three levels: (1) a meta-planner considers human dialog to formulate and code a high-level collaboration strategy, (2) a planner optimally allocates the remaining steps to either agent based on the robot's capabilities (measured by a simulation-pretrained affordance model) and the human's estimated availability to help, and (3) an action executor decides the low-level actions to perform or words to say to the human. Our extensive evaluations in simulation and real-world -- on a physical robot with 18 unique human participants over 27 hours -- demonstrate the ability of our method to effectively collaborate with diverse human users, yielding significantly improved task success and user experience than a pure LLM baseline and other agent allocation models. See additional videos and materials at https://robin-lab.cs.utexas.edu/MicoBot/.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.93)
- Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.91)
Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping
Kim, Sang Min, Heo, Hyeongjun, Kim, Junho, Lee, Yonghyeon, Kim, Young Min
We propose Point2Act, which directly retrieves the 3D action point relevant for a contextually described task, leveraging Multimodal Large Language Models (MLLMs). Foundation models opened the possibility for generalist robots that can perform a zero-shot task following natural language descriptions within an unseen environment. While the semantics obtained from large-scale image and language datasets provide contextual understanding in 2D images, the rich yet nuanced features deduce blurry 2D regions and struggle to find precise 3D locations for actions. Our proposed 3D relevancy fields bypass the high-dimensional features and instead efficiently imbue lightweight 2D point-level guidance tailored to the task-specific action. The multi-view aggregation effectively compensates for misalignments due to geometric ambiguities, such as occlusion, or semantic uncertainties inherent in the language descriptions. The output region is highly localized, reasoning fine-grained 3D spatial context that can directly transfer to an explicit position for physical action at the on-the-fly reconstruction of the scene. Our full-stack pipeline, which includes capturing, MLLM querying, 3D reconstruction, and grasp pose extraction, generates spatially grounded responses in under 20 seconds, facilitating practical manipulation tasks. Project page: https://sangminkim-99.github.io/point2act/
- Africa > Togo (0.05)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
Multilingual LLMs Are Not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation
Gupta, Ashray, Joseph, Rohan, Rai, Sunny
While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi. The test set is publicly available for research purposes here https://github.com/Inequilazitive/
- Workflow (0.72)
- Research Report (0.64)
Beyond Nash Equilibrium: Bounded Rationality of LLMs and humans in Strategic Decision-making
Zheng, Kehan, Zhou, Jinfeng, Wang, Hongning
Large language models are increasingly used in strategic decision-making settings, yet evidence shows that, like humans, they often deviate from full rationality. In this study, we compare LLMs and humans using experimental paradigms directly adapted from behavioral game-theory research. We focus on two well-studied strategic games, Rock-Paper-Scissors and the Prisoner's Dilemma, which are well known for revealing systematic departures from rational play in human subjects. By placing LLMs in identical experimental conditions, we evaluate whether their behaviors exhibit the bounded rationality characteristic of humans. Our findings show that LLMs reproduce familiar human heuristics, such as outcome-based strategy switching and increased cooperation when future interaction is possible, but they apply these rules more rigidly and demonstrate weaker sensitivity to the dynamic changes in the game environment. Model-level analyses reveal distinctive architectural signatures in strategic behavior, and even reasoning models sometimes struggle to find effective strategies in adaptive situations. These results indicate that current LLMs capture only a partial form of human-like bounded rationality and highlight the need for training methods that encourage flexible opponent modeling and stronger context awareness.
Robustness-Aware Tool Selection and Manipulation Planning with Learned Energy-Informed Guidance
Dong, Yifei, Zhang, Yan, Calinon, Sylvain, Pokorny, Florian T.
Humans subconsciously choose robust ways of selecting and using tools, based on years of embodied experience -- for example, choosing a ladle instead of a flat spatula to serve meatballs. However, robustness under uncertainty remains underexplored in robotic tool-use planning. This paper presents a robustness-aware framework that jointly selects tools and plans contact-rich manipulation trajectories, explicitly optimizing for robustness against environmental disturbances. At the core of our approach is a learned, energy-based robustness metric, which guides the planner towards robust manipulation behaviors. We formulate a hierarchical optimization pipeline that first identifies a tool and configuration that optimizes robustness, and then plans a corresponding manipulation trajectory that maintains robustness throughout execution. We evaluate our approach across three representative tool-use tasks. Simulation and real-world results demonstrate that our approach consistently selects robust tools and generates disturbance-resilient manipulation plans.