Wang, Xiaoqiang
R$^3$Mem: Bridging Memory Retention and Retrieval via Reversible Compression
Wang, Xiaoqiang, Wang, Suyuchen, Zhu, Yun, Liu, Bang
Memory plays a key role in enhancing LLMs' performance when deployed to real-world applications. Existing solutions face trade-offs: explicit memory designs based on external storage require complex management and incur storage overhead, while implicit memory designs that store information via parameters struggle with reliable retrieval. In this paper, we propose R$^3$Mem, a memory network that optimizes both information Retention and Retrieval through Reversible context compression. Specifically, R$^3$Mem employs virtual memory tokens to compress and encode infinitely long histories, further enhanced by a hierarchical compression strategy that refines information from document- to entity-level for improved assimilation across granularities. For retrieval, R$^3$Mem employs a reversible architecture, reconstructing raw data by invoking the model backward with compressed information. Implemented via parameter-efficient fine-tuning, it can integrate seamlessly with any Transformer-based model. Experiments demonstrate that our memory design achieves state-of-the-art performance in long-context language modeling and retrieval-augmented generation tasks. It also significantly outperforms conventional memory modules in long-horizon interaction tasks like conversational agents, showcasing its potential for next-generation retrieval systems.
AverageLinear: Enhance Long-Term Time series forcasting with simple averaging
Zhao, Gaoxiang, Zhou, Li, Wang, Xiaoqiang
Long-term time series prediction involves forecasting future trends over extended periods based on historical changes. This approach is crucial in various fields such as weather [1], traffic [2], and power [3]. The exceptionally long forecast horizon and the complex correlations between channels pose significant challenges to modeling. Traditional methods often fall short in capturing the sequence and inter-channel relationships. In contrast, deep learning architectures, with their superior fitting capabilities, have emerged as effective tools for addressing long-term time series prediction. Consequently, the primary methodologies in this field have shifted towards deep learning models. The core issue in long time series analysis is extracting dependencies within sequences and correlations across channels, which significantly benefits model performance in multi-channel prediction and robustness. Various methods have been developed to capture this information from time series data. Commonly used techniques include Transformers [4, 5, 6, 7, 8, 9, 10], which apply attention mechanisms to effectively capture correlations both within sequences and across channels; Convolutional Neural Networks (CNN) [11, 12] that use 1D or multidimensional convolutions to capture these dependencies; and structures based on Multilayer Perceptrons[13, 14, 15, 16, 17], such as DLinear, which decompose sequences and apply multiple linear layers to capture sequence correlations.
OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning
Wang, Xiaoqiang, Liu, Bang
Large language models (LLMs) and large multimodal models (LMMs) have shown great potential in automating complex tasks like web browsing and gaming. However, their ability to generalize across diverse applications remains limited, hindering broader utility. To address this challenge, we present OSCAR: Operating System Control via state-Aware reasoning and Re-planning. OSCAR is a generalist agent designed to autonomously navigate and interact with various desktop and mobile applications through standardized controls, such as mouse and keyboard inputs, while processing screen images to fulfill user commands. To enhance stability and adaptability, OSCAR operates as a state machine, equipped with error-handling mechanisms and task-driven re-planning, allowing it to efficiently adjust to real-time feedback and exceptions. We demonstrate OSCAR's effectiveness through extensive experiments on diverse benchmarks across desktop and mobile platforms, where it transforms complex workflows into simple natural language commands, significantly boosting user productivity. Our code will be open-source upon publication. These model-centric agents show revolutionary potential for automating real-world tasks such as web browsing (Gur et al., 2023), gaming (Krzywinska, 2024), and software development (Hong et al.). However, despite impressive results, these agents struggle to generalize across different applications due to variations in observation and action spaces. In real-world scenarios, workflows often involve switching between applications and interacting with diverse graphical or command-line interfaces. This raises an intriguing and practical question: can we build a generalist agent capable of following user instructions across various applications using standardized operating system (OS) controls like mouse and keyboard inputs, while processing screen outputs?
A Mallows-like Criterion for Anomaly Detection with Random Forest Implementation
Zhao, Gaoxiang, Wang, Lu, Wang, Xiaoqiang
The effectiveness of anomaly signal detection can be significantly undermined by the inherent uncertainty of relying on one specified model. Under the framework of model average methods, this paper proposes a novel criterion to select the weights on aggregation of multiple models, wherein the focal loss function accounts for the classification of extremely imbalanced data. This strategy is further integrated into Random Forest algorithm by replacing the conventional voting method. We have evaluated the proposed method on benchmark datasets across various domains, including network intrusion. The findings indicate that our proposed method not only surpasses the model averaging with typical loss functions but also outstrips common anomaly detection algorithms in terms of accuracy and robustness.
Reference Neural Operators: Learning the Smooth Dependence of Solutions of PDEs on Geometric Deformations
Cheng, Ze, Hao, Zhongkai, Wang, Xiaoqiang, Huang, Jianing, Wu, Youjia, Liu, Xudan, Zhao, Yiru, Liu, Songming, Su, Hang
For partial differential equations on domains of arbitrary shapes, existing works of neural operators attempt to learn a mapping from geometries to solutions. It often requires a large dataset of geometry-solution pairs in order to obtain a sufficiently accurate neural operator. However, for many industrial applications, e.g., engineering design optimization, it can be prohibitive to satisfy the requirement since even a single simulation may take hours or days of computation. To address this issue, we propose reference neural operators (RNO), a novel way of implementing neural operators, i.e., to learn the smooth dependence of solutions on geometric deformations. Specifically, given a reference solution, RNO can predict solutions corresponding to arbitrary deformations of the referred geometry. This approach turns out to be much more data efficient. Through extensive experiments, we show that RNO can learn the dependence across various types and different numbers of geometry objects with relatively small datasets. RNO outperforms baseline models in accuracy by a large lead and achieves up to 80% error reduction.
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition
Wang, Xiaoqiang, Liu, Bang, Wu, Lingfei
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. However, such a paradigm fails to comprehensively differentiate the fine-grained language and cognitive skills, rendering the lack of sufficient interpretation to LLMs' capabilities. In this paper, we present FAC$^2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation. Specifically, we formulate LLMs' evaluation in a multi-dimensional and explainable manner by dissociating the language-related capabilities and the cognition-related ones. Besides, through extracting the intermediate reasoning from LLMs, we further break down the process of applying a specific capability into three sub-steps: recalling relevant knowledge, utilizing knowledge, and solving problems. Finally, FAC$^2$E evaluates each sub-step of each fine-grained capability, providing a two-faceted diagnosis for LLMs. Utilizing FAC$^2$E, we identify a common shortfall in knowledge utilization among models and propose a straightforward, knowledge-enhanced method to mitigate this issue. Our results not only showcase promising performance enhancements but also highlight a direction for future LLM advancements.
SkillQG: Learning to Generate Question for Reading Comprehension Assessment
Wang, Xiaoqiang, Liu, Bang, Tang, Siliang, Wu, Lingfei
We present $\textbf{$\texttt{SkillQG}$}$: a question generation framework with controllable comprehension types for assessing and improving machine reading comprehension models. Existing question generation systems widely differentiate questions by $\textit{literal}$ information such as question words and answer types to generate semantically relevant questions for a given context. However, they rarely consider the $\textit{comprehension}$ nature of questions, i.e. the different comprehension capabilities embodied by different questions. In comparison, our $\texttt{SkillQG}$ is able to tailor a fine-grained assessment and improvement to the capabilities of question answering models built on it. Specifically, we first frame the comprehension type of questions based on a hierarchical skill-based schema, then formulate $\texttt{SkillQG}$ as a skill-conditioned question generator. Furthermore, to improve the controllability of generation, we augment the input text with question focus and skill-specific knowledge, which are constructed by iteratively prompting the pre-trained language models. Empirical results demonstrate that $\texttt{SkillQG}$ outperforms baselines in terms of quality, relevance, and skill-controllability while showing a promising performance boost in downstream question answering task.
Improving Contextual Spelling Correction by External Acoustics Attention and Semantic Aware Data Augmentation
Wang, Xiaoqiang, Liu, Yanqing, Li, Jinyu, Zhao, Sheng
We previously proposed contextual spelling correction (CSC) to correct the output of end-to-end (E2E) automatic speech recognition (ASR) models with contextual information such as name, place, etc. Although CSC has achieved reasonable improvement in the biasing problem, there are still two drawbacks for further accuracy improvement. First, due to information limitation in text only hypothesis or weak performance of ASR model on rare domains, the CSC model may fail to correct phrases with similar pronunciation or anti-context cases where all biasing phrases are not present in the utterance. Second, there is a discrepancy between the training and inference of CSC. The bias list in training is randomly selected but in inference there may be more similarity between ground truth phrase and other phrases. To solve above limitations, in this paper we propose an improved non-autoregressive (NAR) spelling correction model for contextual biasing in E2E neural transducer-based ASR systems to improve the previous CSC model from two perspectives: Firstly, we incorporate acoustics information with an external attention as well as text hypotheses into CSC to better distinguish target phrase from dissimilar or irrelevant phrases. Secondly, we design a semantic aware data augmentation schema in training phrase to reduce the mismatch between training and inference to further boost the biasing accuracy. Experiments show that the improved method outperforms the baseline ASR+Biasing system by as much as 20.3% relative name recall gain and achieves stable improvement compared to the previous CSC method over different bias list name coverage ratio.
A Light-weight contextual spelling correction model for customizing transducer-based speech recognition systems
Wang, Xiaoqiang, Liu, Yanqing, Zhao, Sheng, Li, Jinyu
It's challenging to customize transducer-based automatic In this work, we propose a novel contextual biasing method speech recognition (ASR) system with context information which leverages contextual information by adding a contextual which is dynamic and unavailable during model training. In spelling correction (CSC) model on top of the transducer this work, we introduce a light-weight contextual spelling correction model. To consider contextual information during correction, model to correct context-related recognition errors in a context encoder which encodes context phrases into hidden transducer-based ASR systems. We incorporate the context information embeddings is added to the spelling correction model [16, 17], into the spelling correction model with a shared context the decoder of the correction model then attends to the context encoder and use a filtering algorithm to handle large-size encoder and text encoder by attention mechanism [18].
Large-scale Traffic Signal Control Using a Novel Multi-Agent Reinforcement Learning
Wang, Xiaoqiang, Ke, Liangjun, Qiao, Zhimin, Chai, Xinghua
Finding the optimal signal timing strategy is a difficult task for the problem of large-scale traffic signal control (TSC). Multi-Agent Reinforcement Learning (MARL) is a promising method to solve this problem. However, there is still room for improvement in extending to large-scale problems and modeling the behaviors of other agents for each individual agent. In this paper, a new MARL, called Cooperative double Q-learning (Co-DQL), is proposed, which has several prominent features. It uses a highly scalable independent double Q-learning method based on double estimators and the UCB policy, which can eliminate the over-estimation problem existing in traditional independent Q-learning while ensuring exploration. It uses mean field approximation to model the interaction among agents, thereby making agents learn a better cooperative strategy. In order to improve the stability and robustness of the learning process, we introduce a new reward allocation mechanism and a local state sharing method. In addition, we analyze the convergence properties of the proposed algorithm. Co-DQL is applied on TSC and tested on a multi-traffic signal simulator. According to the results obtained on several traffic scenarios, Co- DQL outperforms several state-of-the-art decentralized MARL algorithms. It can effectively shorten the average waiting time of the vehicles in the whole road system.