AITopics | Geng, Xin

Collaborating Authors

Geng, Xin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Peng, Yingzhe, Zhang, Gongrui, Zhang, Miaosen, You, Zhiyuan, Liu, Jie, Zhu, Qipeng, Yang, Kai, Xu, Xingzhong, Geng, Xin, Yang, Xu

arXiv.org Artificial IntelligenceMar-10-2025

Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \textbf{LMM-R1}, a two-stage framework adapting rule-based RL for multimodal reasoning through \textbf{Foundational Reasoning Enhancement (FRE)} followed by \textbf{Multimodal Generalization Training (MGT)}. The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that LMM-R1 achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2503.07536

Country:

North America > United States > California (0.14)
Asia > China (0.14)

Genre: Research Report > New Finding (0.92)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

Label Distribution Learning with Biased Annotations by Learning Multi-Label Representation

Kou, Zhiqiang, Qin, Si, Wang, Hailin, Xie, Mingkun, Chen, Shuo, Jia, Yuheng, Liu, Tongliang, Sugiyama, Masashi, Geng, Xin

arXiv.org Artificial IntelligenceFeb-3-2025

Multi-label learning (MLL) has gained attention for its ability to represent real-world data. Label Distribution Learning (LDL), an extension of MLL to learning from label distributions, faces challenges in collecting accurate label distributions. To address the issue of biased annotations, based on the low-rank assumption, existing works recover true distributions from biased observations by exploring the label correlations. However, recent evidence shows that the label distribution tends to be full-rank, and naive apply of low-rank approximation on biased observation leads to inaccurate recovery and performance degradation. In this paper, we address the LDL with biased annotations problem from a novel perspective, where we first degenerate the soft label distribution into a hard multi-hot label and then recover the true label information for each instance. This idea stems from an insight that assigning hard multi-hot labels is often easier than assigning a soft label distribution, and it shows stronger immunity to noise disturbances, leading to smaller label bias. Moreover, assuming that the multi-label space for predicting label distributions is low-rank offers a more reasonable approach to capturing label correlations. Theoretical analysis and experiments confirm the effectiveness and robustness of our method on real-world datasets.

artificial intelligence, label distribution, machine learning, (12 more...)

arXiv.org Artificial Intelligence

2502.0117

Country:

Asia (0.46)
Oceania > Australia (0.28)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Speculative Ensemble: Fast Large Language Model Ensemble via Speculation

Fu, Jiale, Jiang, Yuchu, Chen, Junkai, Fan, Jiaming, Geng, Xin, Yang, Xu

arXiv.org Artificial IntelligenceFeb-1-2025

Ensemble methods enhance Large Language Models (LLMs) by combining multiple models but suffer from high computational costs. In this paper, we introduce Speculative Ensemble, a novel framework that accelerates LLM ensembles without sacrificing performance, inspired by Speculative Decoding-where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel. Our approach builds on two key insights: (1) the verification distribution can be the ensemble distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency. We generalize this method to ensembles with n models and theoretically prove that SE is never slower than a standard ensemble, typically achieving faster speed. Extensive experiments demonstrate speed improvements of 1.11x-2.23x over standard ensemble techniques without compromising generation quality. Our code is available at https://github.com/Kamichanw/Speculative-Ensemble/

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.01662

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Hawaii (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

STHFL: Spatio-Temporal Heterogeneous Federated Learning

Guo, Shunxin, Wang, Hongsong, Lin, Shuxia, Yang, Xu, Geng, Xin

arXiv.org Artificial IntelligenceJan-10-2025

Federated learning is a new framework that protects data privacy and allows multiple devices to cooperate in training machine learning models. Previous studies have proposed multiple approaches to eliminate the challenges posed by non-iid data and inter-domain heterogeneity issues. However, they ignore the \textbf{spatio-temporal} heterogeneity formed by different data distributions of increasing task data in the intra-domain. Moreover, the global data is generally a long-tailed distribution rather than assuming the global data is balanced in practical applications. To tackle the \textbf{spatio-temporal} dilemma, we propose a novel setting named \textbf{Spatio-Temporal Heterogeneity} Federated Learning (STHFL). Specially, the Global-Local Dynamic Prototype (GLDP) framework is designed for STHFL. In GLDP, the model in each client contains personalized layers which can dynamically adapt to different data distributions. For long-tailed data distribution, global prototypes are served as complementary knowledge for the training on classes with few samples in clients without leaking privacy. As tasks increase in clients, the knowledge of local prototypes generated in previous tasks guides for training in the current task to solve catastrophic forgetting. Meanwhile, the global-local prototypes are updated through the moving average method after training local prototypes in clients. Finally, we evaluate the effectiveness of GLDP, which achieves remarkable results compared to state-of-the-art methods in STHFL scenarios.

artificial intelligence, machine learning, prototype, (14 more...)

arXiv.org Artificial Intelligence

2501.05775

Genre: Research Report (0.70)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

BatStyler: Advancing Multi-category Style Generation for Source-free Domain Generalization

Xu, Xiusheng, Qi, Lei, Zhou, Jingyang, Geng, Xin

arXiv.org Artificial IntelligenceJan-2-2025

Source-Free Domain Generalization (SFDG) aims to develop a model that performs on unseen domains without relying on any source domains. However, the implementation remains constrained due to the unavailability of training data. Research on SFDG focus on knowledge transfer of multi-modal models and style synthesis based on joint space of multiple modalities, thus eliminating the dependency on source domain images. However, existing works primarily work for multi-domain and less-category configuration, but performance on multi-domain and multi-category configuration is relatively poor. In addition, the efficiency of style synthesis also deteriorates in multi-category scenarios. How to efficiently synthesize sufficiently diverse data and apply it to multi-category configuration is a direction with greater practical value. In this paper, we propose a method called BatStyler, which is utilized to improve the capability of style synthesis in multi-category scenarios. BatStyler consists of two modules: Coarse Semantic Generation and Uniform Style Generation modules. The Coarse Semantic Generation module extracts coarse-grained semantics to prevent the compression of space for style diversity learning in multi-category configuration, while the Uniform Style Generation module provides a template of styles that are uniformly distributed in space and implements parallel training. Extensive experiments demonstrate that our method exhibits comparable performance on less-category datasets, while surpassing state-of-the-art methods on multi-category datasets.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.01109

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

MageBench: Bridging Large Multimodal Models to Agents

Zhang, Miaosen, Dai, Qi, Yang, Yifan, Bao, Jianmin, Chen, Dongdong, Qiu, Kai, Luo, Chong, Geng, Xin, Guo, Baining

arXiv.org Artificial IntelligenceDec-5-2024

LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents, which demand strong reasoning and planning abilities. Nevertheless, existing benchmarks mostly assess their reasoning abilities in language part, where the chain-of-thought is entirely composed of text.We consider the scenario where visual signals are continuously updated and required along the decision making process. Such vision-in-the-chain reasoning paradigm is more aligned with the needs of multimodal agents, while being rarely evaluated. In this paper, we introduce MageBench, a reasoning capability oriented multimodal agent benchmark that, while having light-weight environments, poses significant reasoning challenges and holds substantial practical value. This benchmark currently includes three types of environments: WebUI, Sokoban, and Football, comprising a total of 483 different scenarios. It thoroughly validates the agent's knowledge and engineering capabilities, visual intelligence, and interaction skills. The results show that only a few product-level models are better than random acting, and all of them are far inferior to human-level. More specifically, we found current models severely lack the ability to modify their planning based on visual feedback, as well as visual imagination, interleaved image-text long context handling, and other abilities. We hope that our work will provide optimization directions for LMM from the perspective of being an agent. We release our code and data at https://github.com/microsoft/MageBench.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.04531

Country: Europe (1.00)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment > Sports > Soccer (0.70)
Leisure & Entertainment > Sports > Football (0.46)
Leisure & Entertainment > Games > Computer Games (0.45)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(3 more...)

Add feedback

Redefining in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation

Feng, Fu, Xie, Yucheng, Yang, Xu, Wang, Jing, Geng, Xin

arXiv.org Artificial IntelligenceNov-20-2024

Given the challenge atively generated using . Furthermore, this that diffusion models face in directly generating creativity, meta-creativity enables direct concept combinations without existing methods typically rely on synthesizing reference requiring additional training, much like generating "a prompts or images to achieve creative effects. This significantly reduces both time and computational instance, to combine "Lettuce" and "Mantis" creatively, complexity compared to state-of-the-art (SOTA) ConceptLab [43] merges tokens representing these concepts creative generation methods, such as ConceptLab [43] (4s into a new composite token, while BASS [22] uses predefined vs. 120s per image, 30 speedup) and BASS [22] (4s vs. sampling rules to search for creative outcomes from a 2400s per image, 600 speedup), while maintaining linguistic large pool of candidate images. Further each generation, which leads to high computational costs evaluations using GPT-4o [1] and user studies indicate superior and limited practicality for online applications. In contrast, performance of CreTok in terms of integration, originality, "a blue banana" can be generated directly without additional and aesthetics, underscoring its effectiveness in fostering training, due to its clear and concrete semantics, especially combinatorial creativity. Inspired by this, we may Our contributions are as follows: (1) We propose Cre-ask: Can we awaken the creativity of diffusion models by Tok, a method designed to enhance models' meta-ability enhancing their semantic understanding of "creative"? To by enabling a enhanced understanding of abstract and ambiguous achieve this, we propose CreTok, which redefines "creative" adjectives (e.g., "creative" or "beautiful") through as a new specialized token, , allowing it their redefinition as new tokens. This redefinition we redefine the abstract term "creative" within our proposed enhances the model's semantic understanding for CangJie dataset for the TP2O task, and introduce combinatorial creativity, as shown in Figure 1c. Specifically, text-to-image (T2I) models and creative generation methods CreTok builds on the definition of "creativity" from in terms of computational complexity, human preference the TP2O task [22] for combinatorial object generation, ratings, text-image alignment, and other key metrics. ") and human-like creativity, a critical yet underexplored aspect an adaptive prompt (e.g., "A photo of a mixture"). of AI research [28, 29].

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.2416

Genre:

Research Report > Promising Solution (0.46)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reduction-based Pseudo-label Generation for Instance-dependent Partial Label Learning

Qiao, Congyu, Xu, Ning, Hu, Yihao, Geng, Xin

arXiv.org Artificial IntelligenceOct-28-2024

Instance-dependent Partial Label Learning (ID-PLL) aims to learn a multi-class predictive model given training instances annotated with candidate labels related to features, among which correct labels are hidden fixed but unknown. The previous works involve leveraging the identification capability of the training model itself to iteratively refine supervision information. However, these methods overlook a critical aspect of ID-PLL: the training model is prone to overfitting on incorrect candidate labels, thereby providing poor supervision information and creating a bottleneck in training. In this paper, we propose to leverage reduction-based pseudo-labels to alleviate the influence of incorrect candidate labels and train our predictive model to overcome this bottleneck. Specifically, reduction-based pseudo-labels are generated by performing weighted aggregation on the outputs of a multi-branch auxiliary model, with each branch trained in a label subspace that excludes certain labels. This approach ensures that each branch explicitly avoids the disturbance of the excluded labels, allowing the pseudo-labels provided for instances troubled by these excluded labels to benefit from the unaffected branches. Theoretically, we demonstrate that reduction-based pseudo-labels exhibit greater consistency with the Bayes optimal classifier compared to pseudo-labels directly generated from the predictive model.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2410.20797

Country:

North America (0.46)
Europe > Austria > Vienna (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.68)
(2 more...)

Add feedback

Towards Better Performance in Incomplete LDL: Addressing Data Imbalance

Kou, Zhiqiang, Xuan, Haoyuan, Wang, Jing, Jia, Yuheng, Geng, Xin

arXiv.org Artificial IntelligenceOct-17-2024

Label Distribution Learning (LDL) is a novel machine learning paradigm that addresses the problem of label ambiguity and has found widespread applications. Obtaining complete label distributions in real-world scenarios is challenging, which has led to the emergence of Incomplete Label Distribution Learning (InLDL). However, the existing InLDL methods overlook a crucial aspect of LDL data: the inherent imbalance in label distributions. To address this limitation, we propose \textbf{Incomplete and Imbalance Label Distribution Learning (I\(^2\)LDL)}, a framework that simultaneously handles incomplete labels and imbalanced label distributions. Our method decomposes the label distribution matrix into a low-rank component for frequent labels and a sparse component for rare labels, effectively capturing the structure of both head and tail labels. We optimize the model using the Alternating Direction Method of Multipliers (ADMM) and derive generalization error bounds via Rademacher complexity, providing strong theoretical guarantees. Extensive experiments on 15 real-world datasets demonstrate the effectiveness and robustness of our proposed framework compared to existing InLDL methods.

artificial intelligence, label distribution, machine learning, (12 more...)

arXiv.org Artificial Intelligence

2410.13579

Country: Asia > China (0.94)

Genre: Research Report (0.50)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science (0.93)

Add feedback

Negative-Prompt-driven Alignment for Generative Language Model

Qiao, Shiqi, Xv, Ning, Liu, Biao, Geng, Xin

arXiv.org Artificial IntelligenceOct-15-2024

Their vast parameters (Kaplan et al., 2020) and extensive training data grant them strong capabilities, but they may still generate outputs that conflict with human values, such as helpless or harmful content. Therefore, AI alignment research has emerged with the goal of fine-tuning LLMs to make them align with human values. One of the most popular alignment methods is RLHF(Reinforcement Learning from Human Feedback) framework (Stiennon et al., 2020; Ziegler et al., 2019; Ouyang et al., 2022), which initially apply supervised fine-tuning to the base model to follow human instructions. Subsequently, a reward model is trained from the human preference data, then optimizing the LLM via PPO algorithm (Schulman et al., 2017) to align with huamn preferences. RLHF requires at least three large models for training, making the process quite complex, and the PPO algorithm itself is highly sophisticated and challenging to parameter-tuning. This drives researchers to explore simpler and more straightforward methods to align language models with human preferences. To simplify alignment, (Rafailov et al., 2023) introduced Direct Preference Optimization (DPO), which provides a closed-form alignment solution and directly uses human preferences for alignment without a separate reward model. Other approaches, like RRHF (Yuan et al., 2023a) and PRO (Song et al., 2024), use SFT-like loss based on multi-ranking datasets to provide richer supervision for alignment.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2410.12194

Country: North America > United States (0.94)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback