Zhou, Li
Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles
Wang, Kuang, Li, Xianfei, Yang, Shenghao, Zhou, Li, Jiang, Feng, Li, Haizhou
User simulators are crucial for replicating human interactions with dialogue systems, supporting both collaborative training and automatic evaluation, especially for large language models (LLMs). However, existing simulators often rely solely on text utterances, missing implicit user traits such as personality, speaking style, and goals. In contrast, persona-based methods lack generalizability, as they depend on predefined profiles of famous individuals or archetypes. To address these challenges, we propose User Simulator with implicit Profiles (USP), a framework that infers implicit user profiles from human-machine conversations and uses them to generate more personalized and realistic dialogues. We first develop an LLM-driven extractor with a comprehensive profile schema. Then, we refine the simulation through conditional supervised fine-tuning and reinforcement learning with cycle consistency, optimizing it at both the utterance and conversation levels. Finally, we adopt a diverse profile sampler to capture the distribution of real-world user profiles. Experimental results demonstrate that USP outperforms strong baselines in terms of authenticity and diversity while achieving comparable performance in consistency. Furthermore, dynamic multi-turn evaluations based on USP strongly align with mainstream benchmarks, demonstrating its effectiveness in real-world applications.
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Huang, Ailin, Wu, Boyong, Wang, Bruce, Yan, Chao, Hu, Chen, Feng, Chengli, Tian, Fei, Shen, Feiyu, Li, Jingbei, Chen, Mingrui, Liu, Peng, Miao, Ruihang, You, Wang, Chen, Xi, Yang, Xuerui, Huang, Yechang, Zhang, Yuxiang, Gong, Zheng, Zhang, Zixin, Zhou, Hongyu, Sun, Jianjian, Li, Brian, Feng, Chengting, Wan, Changyi, Hu, Hanpeng, Wu, Jianchang, Zhen, Jiangjie, Ming, Ranchen, Yuan, Song, Zhang, Xuelin, Zhou, Yu, Li, Bingxin, Ma, Buyun, Wang, Hongyuan, An, Kang, Ji, Wei, Li, Wen, Wen, Xuan, Kong, Xiangwen, Ma, Yuankai, Liang, Yuanwei, Mou, Yun, Ahmidi, Bahtiyar, Wang, Bin, Li, Bo, Miao, Changxin, Xu, Chen, Wang, Chenrun, Shi, Dapeng, Sun, Deshan, Hu, Dingyuan, Sai, Dula, Liu, Enle, Huang, Guanzhe, Yan, Gulin, Wang, Heng, Jia, Haonan, Zhang, Haoyang, Gong, Jiahao, Guo, Junjing, Liu, Jiashuai, Liu, Jiahong, Feng, Jie, Wu, Jie, Wu, Jiaoren, Yang, Jie, Wang, Jinguo, Zhang, Jingyang, Lin, Junzhe, Li, Kaixiang, Xia, Lei, Zhou, Li, Zhao, Liang, Gu, Longlong, Chen, Mei, Wu, Menglin, Li, Ming, Li, Mingxiao, Li, Mingliang, Liang, Mingyao, Wang, Na, Hao, Nie, Wu, Qiling, Tan, Qinyuan, Sun, Ran, Shuai, Shuai, Pang, Shaoliang, Yang, Shiliang, Gao, Shuli, Yuan, Shanshan, Liu, Siqi, Deng, Shihong, Jiang, Shilei, Liu, Sitong, Cao, Tiancheng, Wang, Tianyu, Deng, Wenjin, Xie, Wuxun, Ming, Weipeng, He, Wenqing, Sun, Wen, Han, Xin, Huang, Xin, Deng, Xiaomin, Liu, Xiaojia, Wu, Xin, Zhao, Xu, Wei, Yanan, Yu, Yanbo, Cao, Yang, Li, Yangguang, Ma, Yangzhen, Xu, Yanming, Wang, Yaoyu, Shi, Yaqiang, Wang, Yilei, Zhou, Yizhuang, Zhong, Yinmin, Zhang, Yang, Wei, Yaoben, Luo, Yu, Lu, Yuanwei, Yin, Yuhe, Luo, Yuchu, Ding, Yuanhao, Yan, Yuting, Dai, Yaqi, Yang, Yuxiang, Xie, Zhe, Ge, Zheng, Sun, Zheng, Huang, Zhewei, Chang, Zhichao, Guan, Zhisheng, Yang, Zidong, Zhang, Zili, Jiao, Binxing, Jiang, Daxin, Shum, Heung-Yeung, Chen, Jiansheng, Li, Jing, Zhou, Shuchang, Zhang, Xiangyu, Zhang, Xinhao, Zhu, Yibo
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
Large Language Models Penetration in Scholarly Writing and Peer Review
Zhou, Li, Zhang, Ruijie, Dai, Xunlian, Hershcovich, Daniel, Li, Haizhou
While the widespread use of Large Language Models (LLMs) brings convenience, it also raises concerns about the credibility of academic research and scholarly processes. To better understand these dynamics, we evaluate the penetration of LLMs across academic workflows from multiple perspectives and dimensions, providing compelling evidence of their growing influence. We propose a framework with two components: \texttt{ScholarLens}, a curated dataset of human- and LLM-generated content across scholarly writing and peer review for multi-perspective evaluation, and \texttt{LLMetrica}, a tool for assessing LLM penetration using rule-based metrics and model-based detectors for multi-dimensional evaluation. Our experiments demonstrate the effectiveness of \texttt{LLMetrica}, revealing the increasing role of LLMs in scholarly processes. These findings emphasize the need for transparency, accountability, and ethical practices in LLM usage to maintain academic credibility.
RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
Liu, Wanlong, Chen, Junying, Ji, Ke, Zhou, Li, Chen, Wenyu, Wang, Benyou
Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that RAG-Instruct effectively enhances LLMs' RAG capabilities, achieving strong zero-shot performance and significantly outperforming various RAG baselines across a diverse set of tasks. RAG-Instruct is publicly available at https://github.com/FreedomIntelligence/RAG-Instruct.
AverageLinear: Enhance Long-Term Time series forcasting with simple averaging
Zhao, Gaoxiang, Zhou, Li, Wang, Xiaoqiang
Long-term time series prediction involves forecasting future trends over extended periods based on historical changes. This approach is crucial in various fields such as weather [1], traffic [2], and power [3]. The exceptionally long forecast horizon and the complex correlations between channels pose significant challenges to modeling. Traditional methods often fall short in capturing the sequence and inter-channel relationships. In contrast, deep learning architectures, with their superior fitting capabilities, have emerged as effective tools for addressing long-term time series prediction. Consequently, the primary methodologies in this field have shifted towards deep learning models. The core issue in long time series analysis is extracting dependencies within sequences and correlations across channels, which significantly benefits model performance in multi-channel prediction and robustness. Various methods have been developed to capture this information from time series data. Commonly used techniques include Transformers [4, 5, 6, 7, 8, 9, 10], which apply attention mechanisms to effectively capture correlations both within sequences and across channels; Convolutional Neural Networks (CNN) [11, 12] that use 1D or multidimensional convolutions to capture these dependencies; and structures based on Multilayer Perceptrons[13, 14, 15, 16, 17], such as DLinear, which decompose sequences and apply multiple linear layers to capture sequence correlations.
Electronic Health Records-Based Data-Driven Diabetes Knowledge Unveiling and Risk Prognosis
Pang, Huadong, Zhou, Li, Dong, Yiping, Chen, Peiyuan, Gu, Dian, Lyu, Tianyi, Zhang, Hansong
In the healthcare sector, the application of deep learning technologies has revolutionized data analysis and disease forecasting. This is particularly evident in the field of diabetes, where the deep analysis of Electronic Health Records (EHR) has unlocked new opportunities for early detection and effective intervention strategies. Our research presents an innovative model that synergizes the capabilities of Bidirectional Long Short-Term Memory Networks-Conditional Random Field (BiLSTM-CRF) with a fusion of XGBoost and Logistic Regression. This model is designed to enhance the accuracy of diabetes risk prediction by conducting an in-depth analysis of electronic medical records data. The first phase of our approach involves employing BiLSTM-CRF to delve into the temporal characteristics and latent patterns present in EHR data. This method effectively uncovers the progression trends of diabetes, which are often hidden in the complex data structures of medical records. The second phase leverages the combined strength of XGBoost and Logistic Regression to classify these extracted features and evaluate associated risks. This dual approach facilitates a more nuanced and precise prediction of diabetes, outperforming traditional models, particularly in handling multifaceted and nonlinear medical datasets. Our research demonstrates a notable advancement in diabetes prediction over traditional methods, showcasing the effectiveness of our combined BiLSTM-CRF, XGBoost, and Logistic Regression model. This study highlights the value of data-driven strategies in clinical decision-making, equipping healthcare professionals with precise tools for early detection and intervention. By enabling personalized treatment and timely care, our approach signifies progress in incorporating advanced analytics in healthcare, potentially improving outcomes for diabetes and other chronic conditions.
$\rho$-NeRF: Leveraging Attenuation Priors in Neural Radiance Field for 3D Computed Tomography Reconstruction
Zhou, Li, Fang, Changsheng, Morovati, Bahareh, Liu, Yongtong, Han, Shuo, Xu, Yongshun, Yu, Hengyong
This paper introduces $\rho$-NeRF, a self-supervised approach that sets a new standard in novel view synthesis (NVS) and computed tomography (CT) reconstruction by modeling a continuous volumetric radiance field enriched with physics-based attenuation priors. The $\rho$-NeRF represents a three-dimensional (3D) volume through a fully-connected neural network that takes a single continuous four-dimensional (4D) coordinate, spatial location $(x, y, z)$ and an initialized attenuation value ($\rho$), and outputs the attenuation coefficient at that position. By querying these 4D coordinates along X-ray paths, the classic forward projection technique is applied to integrate attenuation data across the 3D space. By matching and refining pre-initialized attenuation values derived from traditional reconstruction algorithms like Feldkamp-Davis-Kress algorithm (FDK) or conjugate gradient least squares (CGLS), the enriched schema delivers superior fidelity in both projection synthesis and image recognition.
Optimized CNNs for Rapid 3D Point Cloud Object Recognition
Lyu, Tianyi, Gu, Dian, Chen, Peiyuan, Jiang, Yaoting, Zhang, Zhenhong, Pang, Huadong, Zhou, Li, Dong, Yiping
This study introduces a method for efficiently detecting objects within 3D point clouds using convolutional neural networks (CNNs). Our approach adopts a unique feature-centric voting mechanism to construct convolutional layers that capitalize on the typical sparsity observed in input data. We explore the trade-off between accuracy and speed across diverse network architectures and advocate for integrating an $\mathcal{L}_1$ penalty on filter activations to augment sparsity within intermediate layers. This research pioneers the proposal of sparse convolutional layers combined with $\mathcal{L}_1$ regularization to effectively handle large-scale 3D data processing. Our method's efficacy is demonstrated on the MVTec 3D-AD object detection benchmark. The Vote3Deep models, with just three layers, outperform the previous state-of-the-art in both laser-only approaches and combined laser-vision methods. Additionally, they maintain competitive processing speeds. This underscores our approach's capability to substantially enhance detection performance while ensuring computational efficiency suitable for real-time applications.
Real-time Monitoring and Analysis of Track and Field Athletes Based on Edge Computing and Deep Reinforcement Learning Algorithm
Tang, Xiaowei, Long, Bin, Zhou, Li
As a fundamental sports discipline, track and field not In recent years, real-time monitoring and data analysis only forms the core of major events like the Olympics have become increasingly critical in enhancing athletic and World Championships but also plays a crucial role in performance. Studies have shown that by monitoring physiological promoting public health Jacobsson, Ekberg, Timpka, Haggren indicators (such as heart rate, body temperature, and Råsberg, Sjöberg, Mirkovic and Nilsson (2020); Timpka, blood oxygen saturation) and performance metrics (such as Dahlström, Fagher, Adami, Andersson, Jacobsson, Svedin speed, acceleration, and force) in real-time, it is possible to and Bermon (2022). The wide variety of track and field events, identify problems during training promptly and make targeted including sprints, middle and long-distance running, jumps, adjustments. For example, analyzing heart rate changes under and throws, demand high levels of physical fitness, technical different training intensities can assess endurance levels and skills, and mental strength from athletes Guo (2022); Zhang recovery status, while monitoring gait and acceleration during et al. (2023a). To excel in such competitive environments, running can optimize technical movements and improve athletes require not only innate talent and dedication but efficiency Rana and Mittal (2020a). Many studies have begun also scientific and systematic training methods Zhang et al. exploring the potential of using sensor technology and data (2023b); Yuan et al. (2024).
Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement
Cheng, Zihao, Zhou, Li, Jiang, Feng, Wang, Benyou, Li, Haizhou
The rapid development of large language models (LLMs), like ChatGPT, has resulted in the widespread presence of LLM-generated content on social media platforms, raising concerns about misinformation, data biases, and privacy violations, which can undermine trust in online discourse. While detecting LLM-generated content is crucial for mitigating these risks, current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-AI collaboration. To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content. This approach introduces two novel tasks: LLM Role Recognition (LLM-RR), a multi-class classification task that identifies specific roles of LLM in content generation, and LLM Influence Measurement (LLM-IM), a regression task that quantifies the extent of LLM involvement in content creation. To support these tasks, we propose LLMDetect, a benchmark designed to evaluate detectors' performance on these new tasks. LLMDetect includes the Hybrid News Detection Corpus (HNDC) for training detectors, as well as DetectEval, a comprehensive evaluation suite that considers five distinct cross-context variations and multi-intensity variations within the same LLM role. This allows for a thorough assessment of detectors' generalization and robustness across diverse contexts. Our empirical validation of 10 baseline detection methods demonstrates that fine-tuned PLM-based models consistently outperform others on both tasks, while advanced LLMs face challenges in accurately detecting their own generated content. Our experimental results and analysis offer insights for developing more effective detection models for LLM-generated content. This research enhances the understanding of LLM-generated content and establishes a foundation for more nuanced detection methodologies.