AITopics | Ren, Xiaozhe

Collaborating Authors

Ren, Xiaozhe

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Self-Adjust Softmax

Zheng, Chuanyang, Gao, Yihang, Chen, Guoxuan, Shi, Han, Xiong, Jing, Ren, Xiaozhe, Huang, Chao, Jiang, Xin, Li, Zhenguo, Li, Yu

arXiv.org Artificial IntelligenceFeb-25-2025

The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one, achieving superior performances over other alternative functions. However, the softmax function can face a gradient vanishing issue when some elements of the attention scores approach extreme values, such as probabilities close to one or zero. In this paper, we propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying $softmax(x)$ to $x \cdot softmax(x)$ and its normalized variant $\frac{(x - min(x_{\min},0))}{max(0,x_{max})-min(x_{min},0)} \cdot softmax(x)$. We theoretically show that SA-Softmax provides enhanced gradient properties compared to the vanilla softmax function. Moreover, SA-Softmax Attention can be seamlessly integrated into existing Transformer models to their attention mechanisms with minor adjustments. We conducted experiments to evaluate the empirical performance of Transformer models using SA-Softmax compared to the vanilla softmax function. These experiments, involving models with up to 2.7 billion parameters, are conducted across diverse datasets, language tasks, and positional encoding methods.

artificial intelligence, machine learning, softmax, (17 more...)

arXiv.org Artificial Intelligence

2502.18277

Country: Asia > Japan (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

Chen, Guoxuan, Shi, Han, Li, Jiawei, Gao, Yihang, Ren, Xiaozhe, Chen, Yimeng, Jiang, Xin, Li, Zhenguo, Liu, Weiyang, Huang, Chao

arXiv.org Artificial IntelligenceDec-30-2024

Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless special tokens (i.e., separators) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for training acceleration. Experimental results across training-free, training-from-scratch, and post-training settings demonstrate SepLLM's effectiveness. Notably, using the Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in streaming settings, SepLLM effectively processes sequences of up to 4 million tokens or more while maintaining consistent language modeling capabilities.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.12094

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Scaling Law for Language Models Training Considering Batch Size

Shuai, Xian, Wang, Yiding, Wu, Yimeng, Jiang, Xin, Ren, Xiaozhe

arXiv.org Artificial IntelligenceDec-2-2024

Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models. Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.01505

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

DAPE V2: Process Attention Score as Feature Map for Length Extrapolation

Zheng, Chuanyang, Gao, Yihang, Shi, Han, Xiong, Jing, Sun, Jiankai, Li, Jingyao, Huang, Minbin, Ren, Xiaozhe, Ng, Michael, Jiang, Xin, Li, Zhenguo, Li, Yu

arXiv.org Artificial IntelligenceOct-10-2024

The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens, in contrast to earlier feedforward neural networks. In general, the attention scores are determined simply by the key-query products. However, this work's occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encoding indicates that the classical key-query multiplication may limit the performance of Transformers. In this work, we conceptualize attention as a feature map and apply the convolution operator (for neighboring attention scores across different heads) to mimic the processing methods in computer vision. Specifically, the main contribution of this paper is identifying and interpreting the Transformer length extrapolation problem as a result of the limited expressiveness of the naive query and key dot product, and we successfully translate the length extrapolation issue into a well-understood feature map processing problem. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution. Extensive experiments demonstrate that treating attention as a feature map and applying convolution as a processing method significantly enhances Transformer performance. However, the quadratic cost of the key-query multiplication for processing a sequence raised much concern about the modern architecture of Transformers especially for long context inputs.

large language model, machine learning, natural language, (11 more...)

arXiv.org Artificial Intelligence

2410.04798

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CAPE: Context-Adaptive Positional Encoding for Length Extrapolation

Zheng, Chuanyang, Gao, Yihang, Shi, Han, Huang, Minbin, Li, Jingyao, Xiong, Jing, Ren, Xiaozhe, Ng, Michael, Jiang, Xin, Li, Zhenguo, Li, Yu

arXiv.org Artificial IntelligenceMay-23-2024

Positional encoding plays a crucial role in transformers, significantly impacting model performance and length generalization. Prior research has introduced absolute positional encoding (APE) and relative positional encoding (RPE) to distinguish token positions in given sequences. However, both APE and RPE remain fixed after model training regardless of input data, limiting their adaptability and flexibility. Hence, we expect that the desired positional encoding should be context-adaptive and can be dynamically adjusted with the given attention. In this paper, we propose a Context-Adaptive Positional Encoding (CAPE) method, which dynamically and semantically adjusts based on input context and learned fixed priors. Experimental validation on real-world datasets (Arxiv, Books3, and CHE) demonstrates that CAPE enhances model performances in terms of trained length and length generalization, where the improvements are statistically significant. The model visualization suggests that our model can keep both local and anti-local information. Finally, we successfully train the model on sequence length 128 and achieve better performance at evaluation sequence length 8192, compared with other static positional encoding methods, revealing the benefit of the adaptive positional encoding method.

large language model, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

2405.14722

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Add feedback

A Survey of Reasoning with Foundation Models

Sun, Jiankai, Zheng, Chuanyang, Xie, Enze, Liu, Zhengying, Chu, Ruihang, Qiu, Jianing, Xu, Jiaqi, Ding, Mingyu, Li, Hongyang, Geng, Mengzhe, Wu, Yue, Wang, Wenhai, Chen, Junsong, Yin, Zhangyue, Ren, Xiaozhe, Fu, Jie, He, Junxian, Yuan, Wu, Liu, Qi, Liu, Xihui, Li, Yu, Dong, Hao, Cheng, Yu, Zhang, Ming, Heng, Pheng Ann, Dai, Jifeng, Luo, Ping, Wang, Jingdong, Wen, Ji-Rong, Qiu, Xipeng, Guo, Yike, Xiong, Hui, Liu, Qun, Li, Zhenguo

arXiv.org Artificial IntelligenceJan-25-2024

Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing development of foundation models, e.g., Large Language Models (LLMs), there is a growing interest in exploring their abilities in reasoning tasks. In this paper, we introduce seminal foundation models proposed or adaptable for reasoning, highlighting the latest advancements in various reasoning tasks, methods, and benchmarks. We then delve into the potential future directions behind the emergence of reasoning abilities within foundation models. We also discuss the relevance of multimodal learning, autonomous agents, and super alignment in the context of reasoning. By discussing these future research directions, we hope to inspire researchers in their exploration of this field, stimulate further advancements in reasoning with foundation models, and contribute to the development of AGI.

artificial general intelligence, large language model, machine learning, (29 more...)

arXiv.org Artificial Intelligence

2312.11562

Country:

Europe (1.00)
Asia > China (1.00)
Asia > Middle East (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:

Research Report > Promising Solution (1.00)
Research Report > New Finding (1.00)
Overview (1.00)
(2 more...)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(13 more...)

Add feedback

EdgeFM: Leveraging Foundation Model for Open-set Learning on the Edge

Yang, Bufang, He, Lixing, Ling, Neiwen, Yan, Zhenyu, Xing, Guoliang, Shuai, Xian, Ren, Xiaozhe, Jiang, Xin

arXiv.org Artificial IntelligenceNov-22-2023

Deep Learning (DL) models have been widely deployed on IoT devices with the help of advancements in DL algorithms and chips. However, the limited resources of edge devices make these on-device DL models hard to be generalizable to diverse environments and tasks. Although the recently emerged foundation models (FMs) show impressive generalization power, how to effectively leverage the rich knowledge of FMs on resource-limited edge devices is still not explored. In this paper, we propose EdgeFM, a novel edge-cloud cooperative system with open-set recognition capability. EdgeFM selectively uploads unlabeled data to query the FM on the cloud and customizes the specific knowledge and architectures for edge models. Meanwhile, EdgeFM conducts dynamic model switching at run-time taking into account both data uncertainty and dynamic network variations, which ensures the accuracy always close to the original FM. We implement EdgeFM using two FMs on two edge platforms. We evaluate EdgeFM on three public datasets and two self-collected datasets. Results show that EdgeFM can reduce the end-to-end latency up to 3.2x and achieve 34.3% accuracy increase compared with the baseline.

edgefm, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2311.10986

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine > Consumer Health (0.46)

Technology:

Information Technology > Internet of Things (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

CAME: Confidence-guided Adaptive Memory Efficient Optimization

Luo, Yang, Ren, Xiaozhe, Zheng, Zangwei, Jiang, Zhuo, Jiang, Xin, You, Yang

arXiv.org Artificial IntelligenceAug-7-2023

Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent performance in the training of large language models. Nevertheless, the need for adaptivity requires maintaining second-moment estimates of the per-parameter gradients, which entails a high cost of extra memory overheads. To solve this problem, several memory-efficient optimizers (e.g., Adafactor) have been proposed to obtain a drastic reduction in auxiliary memory usage, but with a performance penalty. In this paper, we first study a confidence-guided strategy to reduce the instability of existing memory efficient optimizers. Based on this strategy, we propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods. Extensive experiments demonstrate the training stability and superior performance of CAME across various NLP tasks such as BERT and GPT-2 training. Notably, for BERT pre-training on the large batch size of 32,768, our proposed optimizer attains faster convergence and higher accuracy compared with the Adam optimizer. The implementation of CAME is publicly available.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2307.02047

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Zheng, Zangwei, Ren, Xiaozhe, Xue, Fuzhao, Luo, Yang, Jiang, Xin, You, Yang

arXiv.org Artificial IntelligenceMay-28-2023

Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. However, the inference process for LLMs comes with significant computational costs. In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs. Our approach begins by tapping into the potential of LLMs to accurately perceive and predict the response length with minimal overhead. By leveraging this information, we introduce an efficient sequence scheduling technique that groups queries with similar response lengths into micro-batches. We evaluate our approach on real-world instruction datasets using the LLaMA-based model, and our results demonstrate an impressive 86% improvement in inference throughput without compromising effectiveness. Notably, our method is orthogonal to other inference acceleration techniques, making it a valuable addition to many existing toolkits (e.g., FlashAttention, Quantization) for LLM inference.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2305.13144

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

A Study on Transformer Configuration and Training Objective

Xue, Fuzhao, Chen, Jianghai, Sun, Aixin, Ren, Xiaozhe, Zheng, Zangwei, He, Xiaoxin, Chen, Yongming, Jiang, Xin, You, Yang

arXiv.org Artificial IntelligenceMay-18-2023

Transformer-based models have delivered impressive results on many tasks, particularly vision and language tasks. In many model training situations, conventional configurations are typically adopted. For example, we often set the base model with hidden dimensions (i.e. model width) to be 768 and the number of transformer layers (i.e. model depth) to be 12. In this paper, we revisit these conventional configurations. Through theoretical analysis and experimental evaluation, we show that the masked autoencoder is effective in alleviating the over-smoothing issue in deep transformer training. Based on this finding, we propose Bamboo, an idea of using deeper and narrower transformer configurations, for masked autoencoder training. On ImageNet, with such a simple change in configuration, re-designed model achieves 87.1% top-1 accuracy and outperforms SoTA models like MAE and BEiT. On language tasks, re-designed model outperforms BERT with default setting by 1.1 points on average, on GLUE datasets.

artificial intelligence, machine learning, transformer configuration and training objective

arXiv.org Artificial Intelligence

2205.10505

Genre:

Research Report (0.40)
Instructional Material (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.73)

Add feedback