AITopics | Li, Siyu

Collaborating Authors

Li, Siyu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

HierDAMap: Towards Universal Domain Adaptive BEV Mapping via Hierarchical Perspective Priors

Li, Siyu, Cao, Yihong, Shi, Hao, Zang, Yongsheng, He, Xuan, Yang, Kailun, Li, Zhiyong

arXiv.org Artificial IntelligenceMar-9-2025

The exploration of Bird's-Eye View (BEV) mapping technology has driven significant innovation in visual perception technology for autonomous driving. BEV mapping models need to be applied to the unlabeled real world, making the study of unsupervised domain adaptation models an essential path. However, research on unsupervised domain adaptation for BEV mapping remains limited and cannot perfectly accommodate all BEV mapping tasks. To address this gap, this paper proposes HierDAMap, a universal and holistic BEV domain adaptation framework with hierarchical perspective priors. Unlike existing research that solely focuses on image-level learning using prior knowledge, this paper explores the guiding role of perspective prior knowledge across three distinct levels: global, sparse, and instance levels. With these priors, HierDA consists of three essential components, including Semantic-Guided Pseudo Supervision (SGPS), Dynamic-Aware Coherence Learning (DACL), and Cross-Domain Frustum Mixing (CDFM). SGPS constrains the cross-domain consistency of perspective feature distribution through pseudo labels generated by vision foundation models in 2D space. To mitigate feature distribution discrepancies caused by spatial variations, DACL employs uncertainty-aware predicted depth as an intermediary to derive dynamic BEV labels from perspective pseudo-labels, thereby constraining the coarse BEV features derived from corresponding perspective features. CDFM, on the other hand, leverages perspective masks of view frustum to mix multi-view perspective images from both domains, which guides cross-domain view transformation and encoding learning through mixed BEV labels. The proposed method is verified on multiple BEV mapping tasks, such as BEV semantic segmentation, high-definition semantic, and vectorized mapping. The source code will be made publicly available at https://github.com/lynn-yu/HierDAMap.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.06821

Country:

Asia > Singapore (0.16)
Asia > China (0.14)

Genre: Research Report (1.00)

Industry:

Information Technology (0.48)
Transportation > Ground > Road (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

MIO: A Foundation Model on Multimodal Tokens

Wang, Zekun, Zhu, King, Xu, Chunpu, Zhou, Wangchunshu, Liu, Jiaheng, Zhang, Yibo, Wang, Jiashuo, Shi, Ning, Li, Siyu, Li, Yizhi, Que, Haoran, Zhang, Zhaoxiang, Zhang, Yuanxing, Zhang, Ge, Xu, Ke, Fu, Jie, Huang, Wenhao

arXiv.org Artificial IntelligenceJan-13-2025

In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc. Codes and models are available at https://github.com/MIO-Team/MIO. The advent of Large Language Models (LLMs) is commonly considered the dawn of artificial general intelligence (AGI) (OpenAI et al., 2023; Bubeck et al., 2023), given their generalist capabilities such as complex reasoning (Wei et al., 2022), role playing (Wang et al., 2023c), and creative writing (Wang et al., 2024a). These MM-LLMs typically involve an external multimodal encoder, such as EVA-CLIP (Sun et al., 2023b) or CLAP (Elizalde et al., 2022), with an alignment module such as Q-Former (Li et al., 2023b) or MLP (Liu et al., 2023b) for multimodal understanding. These modules align non-textual-modality data features into the embedding space of the LLM backbone. Another line of work involves building any-to-any and end-to-end MM-LLMs that can input and output non-textual modality data. I/O Consistency indicates whether the model ensures that the input and output representations for the same data remain consistent. SFT refers to whether the model undergoes a unified (Uni.)

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2409.17692

Country:

Asia (0.67)
North America > United States > Oregon > Multnomah County > Portland (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.82)

Industry: Media (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

Proposing and solving olympiad geometry with guided tree search

Zhang, Chi, Song, Jiajun, Li, Siyu, Liang, Yitao, Ma, Yuxi, Wang, Wei, Zhu, Yixin, Zhu, Song-Chun

arXiv.org Artificial IntelligenceDec-13-2024

Mathematics olympiads are prestigious competitions, with problem proposing and solving highly honored. Building artificial intelligence that proposes and solves olympiads presents an unresolved challenge in automated theorem discovery and proving, especially in geometry for its combination of numerical and spatial elements. We introduce TongGeometry, a Euclidean geometry system supporting tree-search-based guided problem proposing and solving. The efficient geometry system establishes the most extensive repository of geometry theorems to date: within the same computational budget as the existing state-of-the-art, TongGeometry discovers 6.7 billion geometry theorems requiring auxiliary constructions, including 4.1 billion exhibiting geometric symmetry. Among them, 10 theorems were proposed to regional mathematical olympiads with 3 of TongGeometry's proposals selected in real competitions, earning spots in a national team qualifying exam or a top civil olympiad in China and the US. Guided by fine-tuned large language models, TongGeometry solved all International Mathematical Olympiad geometry in IMO-AG-30, outperforming gold medalists for the first time. It also surpasses the existing state-of-the-art across a broader spectrum of olympiad-level problems. The full capabilities of the system can be utilized on a consumer-grade machine, making the model more accessible and fostering widespread democratization of its use. By analogy, unlike existing systems that merely solve problems like students, TongGeometry acts like a geometry coach, discovering, presenting, and proving theorems.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.10673

Country: North America > United States (0.68)

Genre:

Research Report (0.40)
Personal (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

Group-Control Motion Planning Framework for Microrobot Swarms in a Global Field

Li, Siyu, Shervedani, Afagh Mehri, Žefran, Miloš, Paprotny, Igor

arXiv.org Artificial IntelligenceJun-19-2024

This paper investigates how group-control can be effectively used for motion planning for microrobot swarms in a global field. We prove that Small-Time Local Controllability (STLC) in robot positions is achievable through group-control, with the minimum number of groups required for STLC being $\log_2(n + 2) + 1$ for $n$ robots. We then discuss the complexity trade-offs between control and motion planning. We show how motion planning can be simplified if appropriate primitives can be achieved through more complex control actions. We identify motion planning problems that balance the number of robot groups and motion primitives with planning complexity. Various instantiations of these motion planning problems are explored, with simulations to demonstrate the effectiveness of group-control.

artificial intelligence, complexity, robot, (15 more...)

arXiv.org Artificial Intelligence

2406.13829

Country: North America > United States > Illinois (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (1.00)

Add feedback

DTCLMapper: Dual Temporal Consistent Learning for Vectorized HD Map Construction

Li, Siyu, Lin, Jiacheng, Shi, Hao, Zhang, Jiaming, Wang, Song, Yao, You, Li, Zhiyong, Yang, Kailun

arXiv.org Artificial IntelligenceMay-8-2024

Temporal information plays a pivotal role in Bird's-Eye-View (BEV) driving scene understanding, which can alleviate the visual information sparsity. However, the indiscriminate temporal fusion method will cause the barrier of feature redundancy when constructing vectorized High-Definition (HD) maps. In this paper, we revisit the temporal fusion of vectorized HD maps, focusing on temporal instance consistency and temporal map consistency learning. To improve the representation of instances in single-frame maps, we introduce a novel method, DTCLMapper. This approach uses a dual-stream temporal consistency learning module that combines instance embedding with geometry maps. In the instance embedding component, our approach integrates temporal Instance Consistency Learning (ICL), ensuring consistency from vector points and instance features aggregated from points. A vectorized points pre-selection module is employed to enhance the regression efficiency of vector points from each instance. Then aggregated instance features obtained from the vectorized points preselection module are grounded in contrastive learning to realize temporal consistency, where positive and negative samples are selected based on position and semantic information. The geometry mapping component introduces Map Consistency Learning (MCL) designed with self-supervised learning. The MCL enhances the generalization capability of our consistent learning approach by concentrating on the global location and distribution constraints of the instances. Extensive experiments on well-recognized benchmarks indicate that the proposed DTCLMapper achieves state-of-the-art performance in vectorized mapping tasks, reaching 61.9% and 65.1% mAP scores on the nuScenes and Argoverse datasets, respectively. The source code will be made publicly available at https://github.com/lynn-yu/DTCLMapper.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2405.05518

Country:

North America > United States > California (0.28)
Europe > Germany (0.28)

Genre:

Research Report > Promising Solution (0.66)
Research Report > New Finding (0.46)

Industry:

Education (0.68)
Transportation (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

MambaMOS: LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model

Zeng, Kang, Shi, Hao, Lin, Jiacheng, Li, Siyu, Cheng, Jintao, Wang, Kaiwei, Li, Zhiyong, Yang, Kailun

arXiv.org Artificial IntelligenceApr-19-2024

LiDAR-based Moving Object Segmentation (MOS) aims to locate and segment moving objects in point clouds of the current scan using motion information from previous scans. Despite the promising results achieved by previous MOS methods, several key issues, such as the weak coupling of temporal and spatial information, still need further study. In this paper, we propose a novel LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model, termed MambaMOS. Firstly, we develop a novel embedding module, the Time Clue Bootstrapping Embedding (TCBE), to enhance the coupling of temporal and spatial information in point clouds and alleviate the issue of overlooked temporal clues. Secondly, we introduce the Motion-aware State Space Model (MSSM) to endow the model with the capacity to understand the temporal correlations of the same object across different time steps. Specifically, MSSM emphasizes the motion states of the same object at different time steps through two distinct temporal modeling and correlation steps. We utilize an improved state space model to represent these motion differences, significantly modeling the motion states. Finally, extensive experiments on the SemanticKITTI-MOS and KITTI-Road benchmarks demonstrate that the proposed MambaMOS achieves state-of-the-art performance. The source code of this work will be made publicly available at https://github.com/Terminal-K/MambaMOS.

artificial intelligence, information, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2404.12794

Country: Asia (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Text Classification Based on Knowledge Graphs and Improved Attention Mechanism

Li, Siyu, Chen, Lu, Song, Chenwei, Liu, Xinyi

arXiv.org Artificial IntelligenceJan-26-2024

To resolve the semantic ambiguity in texts, we propose a model, which innovatively combines a knowledge graph with an improved attention mechanism. An existing knowledge base is utilized to enrich the text with relevant contextual concepts. The model operates at both character and word levels to deepen its understanding by integrating the concepts. We first adopt information gain to select import words. Then an encoder-decoder framework is used to encode the text along with the related concepts. The local attention mechanism adjusts the weight of each concept, reducing the influence of irrelevant or noisy concepts during classification. We improve the calculation formula for attention scores in the local self-attention mechanism, ensuring that words with different frequencies of occurrence in the text receive higher attention scores. Finally, the model employs a Bi-directional Gated Recurrent Unit (Bi-GRU), which is effective in feature extraction from texts for improved classification accuracy. Its performance is demonstrated on datasets such as AGNews, Ohsumed, and TagMyNews, achieving accuracy of 75.1%, 58.7%, and 68.5% respectively, showing its effectiveness in classifying tasks.

machine learning, natural language, short text, (16 more...)

arXiv.org Artificial Intelligence

2401.03591

Country: Asia > China (0.28)

Genre: Research Report (0.51)

Industry:

Media (0.93)
Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.72)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.67)

Add feedback

Proactive Robot Control for Collaborative Manipulation Using Human Intent

Rysbek, Zhanibek, Li, Siyu, Shervedani, Afagh Mehri, Zefran, Milos

arXiv.org Artificial IntelligenceNov-5-2023

Collaborative manipulation task often requires negotiation using explicit or implicit communication. An important example is determining where to move when the goal destination is not uniquely specified, and who should lead the motion. This work is motivated by the ability of humans to communicate the desired destination of motion through back-and-forth force exchanges. Inherent to these exchanges is also the ability to dynamically assign a role to each participant, either taking the initiative or deferring to the partner's lead. In this paper, we propose a hierarchical robot control framework that emulates human behavior in communicating a motion destination to a human collaborator and in responding to their actions. At the top level, the controller consists of a set of finite-state machines corresponding to different levels of commitment of the robot to its desired goal configuration. The control architecture is loosely based on the human strategy observed in the human-human experiments, and the key component is a real-time intent recognizer that helps the robot respond to human actions. We describe the details of the control framework, and feature engineering and training process of the intent recognition. The proposed controller was implemented on a UR10e robot (Universal Robots) and evaluated through human studies. The experiments show that the robot correctly recognizes and responds to human input, communicates its intent clearly, and resolves conflict. We report success rates and draw comparisons with human-human experiments to demonstrate the effectiveness of the approach.

artificial intelligence, controller, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2311.02809

Country:

North America > United States > Illinois (0.14)
Asia > Japan > Honshū (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Bi-Mapper: Holistic BEV Semantic Mapping for Autonomous Driving

Li, Siyu, Yang, Kailun, Shi, Hao, Zhang, Jiaming, Lin, Jiacheng, Teng, Zhifeng, Li, Zhiyong

arXiv.org Artificial IntelligenceSep-6-2023

--A semantic map of the road scene, covering fundamental road elements, is an essential ingredient in autonomous driving systems. It provides important perception foundations for positioning and planning when rendered in the Bird's-Eye-View (BEV). Currently, the prior knowledge of hypothetical depth can guide the learning of translating front perspective views into BEV directly with the help of calibration parameters. However, it suffers from geometric distortions in the representation of distant objects. In addition, another stream of methods without prior knowledge can learn the transformation between front perspective views and BEV implicitly with a global view. Considering that the fusion of different learning methods may bring surprising beneficial effects, we propose a Bi-Mapper framework for top-down road-scene semantic understanding, which incorporates a global view and local prior knowledge. T o enhance reliable interaction between them, an asynchronous mutual learning strategy is proposed. At the same time, an Across-Space Loss (ASL) is designed to mitigate the negative impact of geometric distortions. Extensive results on nuScenes and Cam2BEV datasets verify the consistent effectiveness of each module in the proposed Bi-Mapper framework. Compared with exiting road mapping networks, the proposed Bi-Mapper achieves 2 . Moreover, we verify the generalization performance of Bi-Mapper in a real-world driving scenario. The source code is publicly available at BiMapper. N autonomous driving systems, a semantic map is an important basic element, which affects the downstream working, including location and planning. Recently, the Bird' s-Eye-View (BEV) map has shown an outstanding performance [1].

artificial intelligence, coordinate system, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2305.04205

Country:

Asia > China (0.14)
Europe > Germany (0.14)
Europe > United Kingdom (0.14)

Genre: Research Report (0.64)

Industry:

Transportation > Ground > Road (0.92)
Information Technology > Robotics & Automation (0.81)
Automobiles & Trucks (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.81)

Add feedback

Deadlock-Free Collision Avoidance for Nonholonomic Robots

Zheng, Ruochen, Li, Siyu

arXiv.org Artificial IntelligenceMay-8-2023

We present a method for deadlock-free and collision-free navigation in a multi-robot system with nonholonomic robots. The problem is solved by quadratic programming and is applicable to most wheeled mobile robots with linear kinematic constraints. We introduce masked velocity and Masked Cooperative Collision Avoidance (MCCA) algorithm to encourage a fully decentralized deadlock avoidance behavior. To verify the method, we provide a detailed implementation and introduce heading oscillation avoidance for differential-drive robots. To the best of our knowledge, it is the first method to give very promising and stable results for deadlock avoidance even in situations with a large number of robots and narrow passages.

artificial intelligence, constraint, robot, (15 more...)

arXiv.org Artificial Intelligence

2305.04511

Country: North America > United States (0.46)

Genre: Research Report (0.40)

Industry: Transportation (0.74)

Technology: Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.64)

Add feedback