AITopics | Problem Solving

Visual question answering ( VQA) is a challenging task that requires an in-depth understanding of vision and language, as well as multi-modal reasoning.

machine learning, natural language, question answering, (18 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Baden-Württemberg > Freiburg (0.04)
Asia > Middle East > Israel (0.04)

Industry: Automobiles & Trucks (0.97)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.72)
(2 more...)

Add feedback

Game Solving with Online Fine-Tuning

Neural Information Processing SystemsOct-9-2025, 05:33:19 GMT

Game solving is a similar, yet more difficult task than mastering a game.

artificial intelligence, machine learning, solver, (13 more...)

Neural Information Processing Systems

Country:

North America > Canada > Alberta (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Taiwan (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games > Computer Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Games (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.69)

Add feedback

9a39b4925e35cf447ccba8757137d84f-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 02:28:01 GMT

evolutionary algorithm, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(8 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.68)
(3 more...)

Add feedback

93b8618a9061f8a55825c13ecf28392b-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 01:42:46 GMT

artificial intelligence, heat map, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States > New York > Tompkins County > Ithaca (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.67)

Add feedback

ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings Shibo Hao

Neural Information Processing SystemsOct-9-2025, 01:22:05 GMT

ToolkenGPT offers the flexibility to plug in an arbitrary number of tools by expanding the set of toolkens on the fly. In addition, it improves tool use by allowing extensive demonstration data for learning the toolken embeddings.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > China > Beijing > Beijing (0.04)
(3 more...)

Genre:

Workflow (0.67)
Overview (0.46)
Research Report (0.46)

Industry:

Leisure & Entertainment (0.93)
Media > Film (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

Liu, Jiahang, Qi, Yunpeng, Zhang, Jiazhao, Li, Minghan, Wang, Shaoan, Wu, Kui, Ye, Hanjing, Zhang, Hong, Chen, Zhibo, Zhong, Fangwei, Zhang, Zhizheng, Wang, He

arXiv.org Artificial IntelligenceOct-9-2025

Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target's relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1 and 12, respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.

arxiv preprint arxiv, large language model, natural language, (13 more...)

arXiv.org Artificial Intelligence

2510.07134

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.69)

Add feedback

WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

Qian, Zezhong, Chi, Xiaowei, Li, Yuming, Wang, Shizun, Qin, Zhiyuan, Ju, Xiaozhu, Han, Sirui, Zhang, Shanghang

arXiv.org Artificial IntelligenceOct-9-2025

Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Y et large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with precisely the geometric and cross-view priors that make it possible to address such extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our designed video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap. The generated wrist observations effectively expanding training data to novel view and lead to significant performance improvements for downstream VLA models across various tasks. Wrist-view observations play a central role in vision-language-action (VLA) models because they directly capture the fine-grained hand-object interactions that underlie precise manipulation.

artificial intelligence, arxiv preprint arxiv, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2510.07313

Country: Asia (0.46)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.82)

Add feedback