AITopics | calvin

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry:

Energy (0.69)
Media (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Neural Information Processing SystemsDec-26-2025, 21:06:16 GMT

CALVIN: Improved Contextual Video Captioning via Instruction Tuning

The recent emergence of powerful Vision-Language models (VLMs) has significantly improved image captioning. Some of these models are extended to caption videos as well. However, their capabilities to understand complex scenes are limited, and the descriptions they provide for scenes tend to be overly verbose and focused on the superficial appearance of objects. Scene descriptions, especially in movies, require a deeper contextual understanding, unlike general-purpose video captioning. To address this challenge, we propose a model, CALVIN, a specialized video LLM that leverages previous movie context to generate fully contextual scene descriptions. To achieve this, we train our model on a suite of tasks that integrate both image-based question-answering and video captioning within a unified framework, before applying instruction tuning to refine the model's ability to provide scene captions. Lastly, we observe that our model responds well to prompt engineering and few-shot in-context learning techniques, enabling the user to adapt it to any new movie with very little additional annotation.

artificial intelligence, large language model, natural language, (6 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.60)

The Atlantic - TechnologyNov-10-2025, 23:59:34 GMT

The New Brutality of OpenAI

The company is pursuing aggressive legal tactics against its opponents. On September 12, Jay Edelson received what he expected to be a standard legal document. Edelson is a lawyer representing the parents of Adam Raine; they are suing OpenAI, alleging that their 16-year-old son took his life at the encouragement of ChatGPT. OpenAI's lawyers had some inquiries for the opposing counsel, which is normal. For instance, they requested information about therapy Raine may have received, and Edelson complied.

large language model, machine learning, natural language, (20 more...)

The Atlantic - Technology

Country:

North America > United States > California (0.05)
North America > United States > Illinois > Cook County > Chicago (0.05)

Industry: Law > Litigation (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)

Neural Information Processing SystemsOct-10-2025, 22:12:55 GMT

fad8962279154544ed69bb63eb14d677-Paper-Conference.pdf

arxiv preprint arxiv, diffusion model, experiment, (14 more...)

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry:

Energy (0.69)
Media (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

arXiv.org Artificial IntelligenceOct-9-2025

WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

Qian, Zezhong, Chi, Xiaowei, Li, Yuming, Wang, Shizun, Qin, Zhiyuan, Ju, Xiaozhu, Han, Sirui, Zhang, Shanghang

Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Y et large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with precisely the geometric and cross-view priors that make it possible to address such extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our designed video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap. The generated wrist observations effectively expanding training data to novel view and lead to significant performance improvements for downstream VLA models across various tasks. Wrist-view observations play a central role in vision-language-action (VLA) models because they directly capture the fine-grained hand-object interactions that underlie precise manipulation.

artificial intelligence, arxiv preprint arxiv, machine learning, (16 more...)

2510.07313

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Singapore (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
(2 more...)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)

Neural Information Processing SystemsMay-27-2025, 12:02:02 GMT

CALVIN: Improved Contextual Video Captioning via Instruction Tuning

The recent emergence of powerful Vision-Language models (VLMs) has significantly improved image captioning. Some of these models are extended to caption videos as well. However, their capabilities to understand complex scenes are limited, and the descriptions they provide for scenes tend to be overly verbose and focused on the superficial appearance of objects. Scene descriptions, especially in movies, require a deeper contextual understanding, unlike general-purpose video captioning. To address this challenge, we propose a model, CALVIN, a specialized video LLM that leverages previous movie context to generate fully "contextual" scene descriptions. To achieve this, we train our model on a suite of tasks that integrate both image-based question-answering and video captioning within a unified framework, before applying instruction tuning to refine the model's ability to provide scene captions.

calvin, improved contextual video captioning, instruction tuning, (1 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.44)

arXiv.org Artificial IntelligenceFeb-14-2025

ManiTrend: Bridging Future Generation and Action Prediction with 3D Flow for Robotic Manipulation

He, Yuxin, Nie, Qiang

Language-conditioned manipulation is a vital but challenging robotic task due to the high-level abstraction of language. To address this, researchers have sought improved goal representations derived from natural language. In this paper, we highlight 3D flow - representing the motion trend of 3D particles within a scene - as an effective bridge between language-based future image generation and fine-grained action prediction. To this end, we develop ManiTrend, a unified framework that models the dynamics of 3D particles, vision observations and manipulation actions with a causal transformer. Within this framework, features for 3D flow prediction serve as additional conditions for future image generation and action prediction, alleviating the complexity of pixel-wise spatiotemporal modeling and providing seamless action guidance. Furthermore, 3D flow can substitute missing or heterogeneous action labels during large-scale pretraining on cross-embodiment demonstrations. Experiments on two comprehensive benchmarks demonstrate that our method achieves state-of-the-art performance with high efficiency. Our code and model checkpoints will be available upon acceptance.

artificial intelligence, machine learning, manitrend, (16 more...)

2502.10028

Country:

Asia > China > Hong Kong (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceOct-22-2024

EnvBridge: Bridging Diverse Environments with Cross-Environment Knowledge Transfer for Embodied AI

Kagaya, Tomoyuki, Lou, Yuxuan, Yuan, Thong Jing, Lakshmi, Subramanian, Karlekar, Jayashree, Pranata, Sugiri, Murakami, Natsuki, Kinose, Akira, Oguri, Koki, Wick, Felix, You, Yang

In recent years, Large Language Models (LLMs) have demonstrated high reasoning capabilities, drawing attention for their applications as agents in various decision-making processes. One notably promising application of LLM agents is robotic manipulation. Recent research has shown that LLMs can generate text planning or control code for robots, providing substantial flexibility and interaction capabilities. However, these methods still face challenges in terms of flexibility and applicability across different environments, limiting their ability to adapt autonomously. Current approaches typically fall into two categories: those relying on environment-specific policy training, which restricts their transferability, and those generating code actions based on fixed prompts, which leads to diminished performance when confronted with new environments. These limitations significantly constrain the generalizability of agents in robotic manipulation. To address these limitations, we propose a novel method called EnvBridge. This approach involves the retention and transfer of successful robot control codes from source environments to target environments. EnvBridge enhances the agent's adaptability and performance across diverse settings by leveraging insights from multiple environments. Our experiments demonstrate that LLM agents can successfully leverage diverse knowledge sources to solve complex tasks. Consequently, our approach significantly enhances the adaptability and robustness of robotic manipulation agents in planning across diverse environments. The development of Large Language Models (LLMs) has remarkably advanced various fields, demonstrating impressive capabilities in understanding and generating human-like text.

envbridge, large language model, natural language, (19 more...)

2410.16919

Country:

Asia > Singapore (0.04)
Europe > Germany (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceMar-20-2024

A Taxonomy of Ambiguity Types for NLP

Li, Margaret Y., Liu, Alisa, Wu, Zhaofeng, Smith, Noah A.

Ambiguity is an critical component of language that allows for more effective communication between speakers, but is often ignored in NLP. Recent work suggests that NLP systems may struggle to grasp certain elements of human language understanding because they may not handle ambiguities at the level that humans naturally do in communication. Additionally, different types of ambiguity may serve different purposes and require different approaches for resolution, and we aim to investigate how language models' abilities vary across types. We propose a taxonomy of ambiguity types as seen in English to facilitate NLP analysis. Our taxonomy can help make meaningful splits in language ambiguity data, allowing for more fine-grained assessments of both datasets and model performance.

ambiguity, ambiguity type, sennet, (14 more...)

2403.14072

Country:

Europe > Norway (0.06)
North America > United States > New York (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
(2 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Ke, Tsung-Wei, Gkanatsios, Nikolaos, Fragkiadaki, Katerina

3D Diffuser Actor: Policy Diffusion with 3D Scene Representations

arXiv.org Artificial IntelligenceMar-11-2024

We marry diffusion policies and 3D scene representations for robot manipulation. Diffusion policies learn the action distribution conditioned on the robot and environment state using conditional diffusion models. They have recently shown to outperform both deterministic and alternative state-conditioned action distribution learning methods. 3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views using sensed depth. They have shown to generalize better than their 2D counterparts across camera viewpoints. We unify these two lines of work and present 3D Diffuser Actor, a neural policy architecture that, given a language instruction, builds a 3D representation of the visual scene and conditions on it to iteratively denoise 3D rotations and translations for the robot's end-effector. At each denoising iteration, our model represents end-effector pose estimates as 3D scene tokens and predicts the 3D translation and rotation error for each of them, by featurizing them using 3D relative attention to other 3D visual and language tokens. 3D Diffuser Actor sets a new state-of-the-art on RLBench with an absolute performance gain of 16.3% over the current SOTA on a multi-view setup and an absolute gain of 13.1% on a single-view setup. On the CALVIN benchmark, it outperforms the current SOTA in the setting of zero-shot unseen scene generalization by being able to successfully run 0.2 more tasks, a 7% relative increase. It also works in the real world from a handful of demonstrations. We ablate our model's architectural design choices, such as 3D scene featurization and 3D relative attentions, and show they all help generalization. Our results suggest that 3D scene representations and powerful generative modeling are keys to efficient robot learning from demonstrations.

diffuser actor, keypose, trajectory, (14 more...)

2402.10885

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)