AITopics | Cao, Tongtong

Collaborating Authors

Cao, Tongtong

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation

Zhang, Lingfeng, Liu, Yuecheng, Zhang, Zhanguang, Aghaei, Matin, Hu, Yaochen, Gu, Hongjian, Alomrani, Mohammad Ali, Bravo, David Gamaliel Arcos, Karimi, Raika, Hamidizadeh, Atia, Xu, Haoping, Huang, Guowei, Zhang, Zhanpeng, Cao, Tongtong, Qiu, Weichao, Quan, Xingyue, Hao, Jianye, Zhuang, Yuzheng, Zhang, Yingxue

arXiv.org Artificial IntelligenceFeb-19-2025

Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have made them powerful tools in embodied navigation, enabling agents to leverage commonsense and spatial reasoning for efficient exploration in unfamiliar environments. Existing LLM-based approaches convert global memory, such as semantic or topological maps, into language descriptions to guide navigation. While this improves efficiency and reduces redundant exploration, the loss of geometric information in language-based representations hinders spatial reasoning, especially in intricate environments. To address this, VLM-based approaches directly process ego-centric visual inputs to select optimal directions for exploration. However, relying solely on a first-person perspective makes navigation a partially observed decision-making problem, leading to suboptimal decisions in complex environments. In this paper, we present a novel vision-language model (VLM)-based navigation framework that addresses these challenges by adaptively retrieving task-relevant cues from a global memory module and integrating them with the agent's egocentric observations. By dynamically aligning global contextual information with local perception, our approach enhances spatial reasoning and decision-making in long-horizon tasks. Experimental results demonstrate that the proposed method surpasses previous state-of-the-art approaches in object navigation tasks, providing a more effective and scalable solution for embodied navigation.

large language model, natural language, navigation, (18 more...)

arXiv.org Artificial Intelligence

2502.14254

Genre:

Overview (0.88)
Research Report > New Finding (0.48)

Industry: Health & Medicine > Consumer Health (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

HIPPo: Harnessing Image-to-3D Priors for Model-free Zero-shot 6D Pose Estimation

Liu, Yibo, Jiang, Zhaodong, Xu, Binbin, Wu, Guile, Ren, Yuan, Cao, Tongtong, Liu, Bingbing, Yang, Rui Heng, Rasouli, Amir, Shan, Jinjun

arXiv.org Artificial IntelligenceFeb-14-2025

This work focuses on model-free zero-shot 6D object pose estimation for robotics applications. While existing methods can estimate the precise 6D pose of objects, they heavily rely on curated CAD models or reference images, the preparation of which is a time-consuming and labor-intensive process. Moreover, in real-world scenarios, 3D models or reference images may not be available in advance and instant robot reaction is desired. In this work, we propose a novel framework named HIPPo, which eliminates the need for curated CAD models and reference images by harnessing image-to-3D priors from Diffusion Models, enabling model-free zero-shot 6D pose estimation. Specifically, we construct HIPPo Dreamer, a rapid image-to-mesh model built on a multiview Diffusion Model and a 3D reconstruction foundation model. Our HIPPo Dreamer can generate a 3D mesh of any unseen objects from a single glance in just a few seconds. Then, as more observations are acquired, we propose to continuously refine the diffusion prior mesh model by joint optimization of object geometry and appearance. This is achieved by a measurement-guided scheme that gradually replaces the plausible diffusion priors with more reliable online observations. Consequently, HIPPo can instantly estimate and track the 6D pose of a novel object and maintain a complete mesh for immediate robotic applications. Thorough experiments on various benchmarks show that HIPPo outperforms state-of-the-art methods in 6D object pose estimation when prior reference images are limited.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2502.10606

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision > Video Understanding (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)

Add feedback

SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning

Liu, Yuecheng, Chi, Dafeng, Wu, Shiguang, Zhang, Zhanguang, Hu, Yaochen, Zhang, Lingfeng, Zhang, Yingxue, Wu, Shuang, Cao, Tongtong, Huang, Guowei, Huang, Helong, Tian, Guangjian, Qiu, Weichao, Quan, Xingyue, Hao, Jianye, Zhuang, Yuzheng

arXiv.org Artificial IntelligenceJan-22-2025

Spatial reasoning is an essential problem in embodied AI research. Efforts to enhance spatial reasoning abilities through supplementary spatial data and fine-tuning have proven limited and ineffective when addressing complex embodied tasks, largely due to their dependence on language-based outputs. While some approaches have introduced a point-based action space to mitigate this issue, they fall short in managing more intricate tasks within complex environments. This deficiency arises from their failure to fully exploit the inherent thinking and reasoning capabilities that are fundamental strengths of Vision-Language Models (VLMs). To address these limitations, we propose a novel approach named SpatialCoT, specifically designed to bolster the spatial reasoning capabilities of VLMs. Our approach comprises two stages: spatial coordinate bi-directional alignment, which aligns vision-language inputs with spatial coordinates, and chain-of-thought spatial grounding, which harnesses the reasoning capabilities of language models for advanced spatial reasoning. We evaluate SpatialCoT on challenging navigation and manipulation tasks, both in simulation and real-world settings. Experimental results demonstrate that our method significantly outperforms previous state-of-the-art approaches in both tasks.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2501.10074

Genre:

Research Report > Promising Solution (0.68)
Overview > Innovation (0.54)
Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

UniGaussian: Driving Scene Reconstruction from Multiple Camera Models via Unified Gaussian Representations

Ren, Yuan, Wu, Guile, Li, Runhao, Yang, Zheyuan, Liu, Yibo, Chen, Xingxin, Cao, Tongtong, Liu, Bingbing

arXiv.org Artificial IntelligenceNov-22-2024

Urban scene reconstruction is crucial for real-world autonomous driving simulators. Although existing methods have achieved photorealistic reconstruction, they mostly focus on pinhole cameras and neglect fisheye cameras. In fact, how to effectively simulate fisheye cameras in driving scene remains an unsolved problem. In this work, we propose UniGaussian, a novel approach that learns a unified 3D Gaussian representation from multiple camera models for urban scene reconstruction in autonomous driving. Our contributions are two-fold. First, we propose a new differentiable rendering method that distorts 3D Gaussians using a series of affine transformations tailored to fisheye camera models. This addresses the compatibility issue of 3D Gaussian splatting with fisheye cameras, which is hindered by light ray distortion caused by lenses or mirrors. Besides, our method maintains real-time rendering while ensuring differentiability. Second, built on the differentiable rendering method, we design a new framework that learns a unified Gaussian representation from multiple camera models. By applying affine transformations to adapt different camera models and regularizing the shared Gaussians with supervision from different modalities, our framework learns a unified 3D Gaussian representation with input data from multiple sources and achieves holistic driving scene understanding. As a result, our approach models multiple sensors (pinhole and fisheye cameras) and modalities (depth, semantic, normal and LiDAR point clouds). Our experiments show that our method achieves superior rendering quality and fast rendering speed for driving scene simulation.

artificial intelligence, driving scene reconstruction, unified gaussian representation, (2 more...)

arXiv.org Artificial Intelligence

2411.15355

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models

Zhang, Lingfeng, Wang, Yuening, Gu, Hongjian, Hamidizadeh, Atia, Zhang, Zhanguang, Liu, Yuecheng, Wang, Yutong, Bravo, David Gamaliel Arcos, Dong, Junyi, Zhou, Shunbo, Cao, Tongtong, Zhuang, Yuzheng, Zhang, Yingxue, Hao, Jianye

arXiv.org Artificial IntelligenceOct-2-2024

Recent advancements in Large Language Models (LLMs) have spurred numerous attempts to apply these technologies to embodied tasks, particularly focusing on high-level task planning and task decomposition. To further explore this area, we introduce a new embodied task planning benchmark, ET-Plan-Bench, which specifically targets embodied task planning using LLMs. It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities, and is designed to evaluate two critical dimensions of LLMs' application in embodied task understanding: spatial (relation constraint, occlusion for target objects) and temporal & causal understanding of the sequence of actions in the environment. By using multi-source simulators as the backend simulator, it can provide immediate environment feedback to LLMs, which enables LLMs to interact dynamically with the environment and re-plan as necessary. We evaluated the state-of-the-art open source and closed source foundation models, including GPT-4, LLAMA and Mistral on our proposed benchmark. While they perform adequately well on simple navigation tasks, their performance can significantly deteriorate when faced with tasks that require a deeper understanding of spatial, temporal, and causal relationships. Thus, our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework that presents a significant challenge to the latest foundation models. We hope it can spark and drive further research in embodied task planning using foundation models.

constraint, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.14682

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

AutoSplat: Constrained Gaussian Splatting for Autonomous Driving Scene Reconstruction

Khan, Mustafa, Fazlali, Hamidreza, Sharma, Dhruv, Cao, Tongtong, Bai, Dongfeng, Ren, Yuan, Liu, Bingbing

arXiv.org Artificial IntelligenceJul-3-2024

Realistic scene reconstruction and view synthesis are essential for advancing autonomous driving systems by simulating safety-critical scenarios. 3D Gaussian Splatting excels in real-time rendering and static scene reconstructions but struggles with modeling driving scenarios due to complex backgrounds, dynamic objects, and sparse views. We propose AutoSplat, a framework employing Gaussian splatting to achieve highly realistic reconstructions of autonomous driving scenes. By imposing geometric constraints on Gaussians representing the road and sky regions, our method enables multi-view consistent simulation of challenging scenarios including lane changes. Leveraging 3D templates, we introduce a reflected Gaussian consistency constraint to supervise both the visible and unseen side of foreground objects. Moreover, to model the dynamic appearance of foreground objects, we estimate residual spherical harmonics for each foreground Gaussian. Extensive experiments on Pandaset and KITTI demonstrate that AutoSplat outperforms state-of-the-art methods in scene reconstruction and novel view synthesis across diverse driving scenarios. Visit our project page at https://autosplat.github.io/.

artificial intelligence, foreground, gaussian, (11 more...)

arXiv.org Artificial Intelligence

2407.02598

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Japan > Honshū > Chūbu (0.14)

Genre: Research Report > Promising Solution (0.34)

Industry:

Transportation > Ground > Road (0.93)
Automobiles & Trucks (0.93)
Information Technology > Robotics & Automation (0.84)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.84)

Add feedback

Analysis of a Modular Autonomous Driving Architecture: The Top Submission to CARLA Leaderboard 2.0 Challenge

Zhang, Weize, Elmahgiubi, Mohammed, Rezaee, Kasra, Khamidehi, Behzad, Mirkhani, Hamidreza, Arasteh, Fazel, Li, Chunlin, Kaleem, Muhammad Ahsan, Corral-Soto, Eduardo R., Sharma, Dhruv, Cao, Tongtong

arXiv.org Artificial IntelligenceMar-21-2024

In this paper we present the architecture of the Kyber-E2E submission to the map track of CARLA Leaderboard 2.0 Autonomous Driving (AD) challenge 2023, which achieved first place. We employed a modular architecture for our solution consists of five main components: sensing, localization, perception, tracking/prediction, and planning/control. Our solution leverages state-of-the-art language-assisted perception models to help our planner perform more reliably in highly challenging traffic scenarios. We use open-source driving datasets in conjunction with Inverse Reinforcement Learning (IRL) to enhance the performance of our motion planner. We provide insight into our design choices and trade-offs made to achieve this solution. We also explore the impact of each component in the overall performance of our solution, with the intent of providing a guideline where allocation of resources can have the greatest impact.

artificial intelligence, machine learning, module, (17 more...)

arXiv.org Artificial Intelligence

2405.01394

Genre: Research Report (0.82)

Industry: Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.55)

Add feedback