Chen, Li
Class-RAG: Real-Time Content Moderation with Retrieval Augmented Generation
Chen, Jianfa, Shen, Emily, Bavalatti, Trupti, Lin, Xiaowen, Wang, Yongkai, Hu, Shuming, Subramanyam, Harihar, Vepuri, Ksheeraj Sai, Jiang, Ming, Qi, Ji, Chen, Li, Jiang, Nan, Jain, Ankit
Recent advances in Generative AI technology have enabled new generations of product applications, such as text generation OpenAI (2023); Anthropic (2023); Dubey (2024), text-to-image generation Ramesh et al. (2021); Dai et al. (2023); Rombach et al. (2022), and text-to-video generation Meta (2024). Consequently, the pace of model development must be matched by the development of safety systems which are properly equipped to mitigate novel harms, ensuring the system's overall integrity and preventing the use of Generative AI products from being exploited by bad actors to disseminate misinformation, glorify violence, and proliferate sexual content Foundation (2023). To achieve this goal, traditional model fine-tuning approaches are often employed, with classifiers learning patterns from labeled content moderation text data leveraged as guardrails OpenAI (2023). However, there are many challenges associated with automating content moderation with fine-tuning. First, content moderation is a highly subjective task, meaning that inter-annotator agreement in labeled data is low, due to different interpretations of policy guidelines, especially on borderline cases Markov et al. (2023). Second, it is impossible to enforce a universal taxonomy of harm, not only due to the subjectivity of the task, but due to the impact of systems scaling to new locales, new audiences, and new use cases, with different guidelines and different gradients of harm defined on those guidelines Shen et al. (2024). Third, the fine-tuning development cycle, which encompasses data collection, annotation, and model experimentation, is not ideally suited to the content moderation domain, where mitigations must land as quickly as possible once vulnerabilities are established. To address these challenges of subjectivity and inflexibility as a result of scale, we propose a Classification approach to content moderation which employs Retrieval-Augmented Generation (Class-RAG) to add context to elicit reasoning for content classification. While RAG Lewis et al. (2020) is often used for knowledge-intensive tasks where factual citation is key, we find that a RAG-based solution offers a distinct value proposition for the classification task of content moderation, not only due to its ability to enhance accuracy with few-shot learning, but because of its ability to make real-time knowledge updates, which is critical in our domain for
Regional Weather Variable Predictions by Machine Learning with Near-Surface Observational and Atmospheric Numerical Data
Zhang, Yihe, Turney, Bryce, Sigdel, Purushottam, Yuan, Xu, Rappin, Eric, Lago, Adrian, Kimball, Sytske, Chen, Li, Darby, Paul, Peng, Lu, Aygun, Sercan, Tu, Yazhou, Najafi, M. Hassan, Tzeng, Nian-Feng
Accurate and timely regional weather prediction is vital for sectors dependent on weather-related decisions. Traditional prediction methods, based on atmospheric equations, often struggle with coarse temporal resolutions and inaccuracies. This paper presents a novel machine learning (ML) model, called MiMa (short for Micro-Macro), that integrates both near-surface observational data from Kentucky Mesonet stations (collected every five minutes, known as Micro data) and hourly atmospheric numerical outputs (termed as Macro data) for fine-resolution weather forecasting. The MiMa model employs an encoder-decoder transformer structure, with two encoders for processing multivariate data from both datasets and a decoder for forecasting weather variables over short time horizons. Each instance of the MiMa model, called a modelet, predicts the values of a specific weather parameter at an individual Mesonet station. The approach is extended with Re-MiMa modelets, which are designed to predict weather variables at ungauged locations by training on multivariate data from a few representative stations in a region, tagged with their elevations. Re-MiMa (short for Regional-MiMa) can provide highly accurate predictions across an entire region, even in areas without observational stations. Experimental results show that MiMa significantly outperforms current models, with Re-MiMa offering precise short-term forecasts for ungauged locations, marking a significant advancement in weather forecasting accuracy and applicability.
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild
Zhao, Xinyu, Sun, Guoheng, Cai, Ruisi, Zhou, Yukun, Li, Pingzhi, Wang, Peihao, Tan, Bowen, He, Yexiao, Chen, Li, Liang, Yi, Chen, Beidi, Yuan, Binhang, Wang, Hongyi, Li, Ang, Wang, Zhangyang, Chen, Tianlong
As Large Language Models (LLMs) excel across tasks and specialized domains, scaling LLMs based on existing models has garnered significant attention, which faces the challenge of decreasing performance when combining disparate models. Various techniques have been proposed for the aggregation of pre-trained LLMs, including model merging, Mixture-of-Experts, and stacking. Despite their merits, a comprehensive comparison and synergistic application of them to a diverse model zoo is yet to be adequately addressed. In light of this research gap, this paper introduces Model-GLUE, a holistic LLM scaling guideline. First, our work starts with a benchmarking of existing LLM scaling techniques, especially selective merging, and variants of mixture. Utilizing the insights from the benchmark results, we formulate an optimal strategy for the selection and aggregation of a heterogeneous model zoo characterizing different architectures and initialization.Our methodology involves the clustering of mergeable models and optimal merging strategy selection, and the integration of clusters through a model mixture. Finally, evidenced by our experiments on a diverse Llama-2-based model zoo, Model-GLUE shows an average performance enhancement of 5.61%, achieved without additional training. Codes are available at: https://github.com/Model-GLUE/Model-GLUE.
Integrating Secondary Structures Information into Triangular Spatial Relationships (TSR) for Advanced Protein Classification
Khajouie, Poorya, Sarkar, Titli, Rauniyar, Krishna, Chen, Li, Xu, Wu, Raghavan, Vijay
Protein structures represent the key to deciphering biological functions. The more detailed form of similarity among these proteins is sometimes overlooked by the conventional structural comparison methods. In contrast, further advanced methods, such as Triangular Spatial Relationship (TSR), have been demonstrated to make finer differentiations. Still, the classical implementation of TSR does not provide for the integration of secondary structure information, which is important for a more detailed understanding of the folding pattern of a protein. To overcome these limitations, we developed the SSE-TSR approach. The proposed method integrates secondary structure elements (SSEs) into TSR-based protein representations. This allows an enriched representation of protein structures by considering 18 different combinations of helix, strand, and coil arrangements. Our results show that using SSEs improves the accuracy and reliability of protein classification to varying degrees. We worked with two large protein datasets of 9.2K and 7.8K samples, respectively. We applied the SSE-TSR approach and used a neural network model for classification. Interestingly, introducing SSEs improved performance statistics for Dataset 1, with accuracy moving from 96.0% to 98.3%. For Dataset 2, where the performance statistics were already good, further small improvements were found with the introduction of SSE, giving an accuracy of 99.5% compared to 99.4%. These results show that SSE integration can dramatically improve TSR key discrimination, with significant benefits in datasets with low initial accuracies and only incremental gains in those with high baseline performance. Thus, SSE-TSR is a powerful bioinformatics tool that improves protein classification and understanding of protein function and interaction.
Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation
Bu, Qingwen, Li, Hongyang, Chen, Li, Cai, Jisong, Zeng, Jia, Cui, Heming, Yao, Maoqing, Qiao, Yu
The increasing demand for versatile robotic systems to operate in diverse and dynamic environments has emphasized the importance of a generalist policy, which leverages a large cross-embodiment data corpus to facilitate broad adaptability and high-level reasoning. However, the generalist would struggle with inefficient inference and cost-expensive training. The specialist policy, instead, is curated for specific domain data and excels at task-level precision with efficiency. Yet, it lacks the generalization capacity for a wide range of applications. Inspired by these observations, we introduce RoboDual, a synergistic dual-system that supplements the merits of both generalist and specialist policy. A diffusion transformer-based specialist is devised for multi-step action rollouts, exquisitely conditioned on the high-level task understanding and discretized action output of a vision-language-action (VLA) based generalist. Compared to OpenVLA, RoboDual achieves 26.7% improvement in real-world setting and 12% gain on CALVIN by introducing a specialist policy with merely 20M trainable parameters. It maintains strong performance with 5% of demonstration data only, and enables a 3.8 times higher control frequency in real-world deployment. Code would be made publicly available. Our project page is hosted at: https://opendrivelab.com/RoboDual/
Integrating Planning into Single-Turn Long-Form Text Generation
Liang, Yi, Wu, You, Zhuang, Honglei, Chen, Li, Shen, Jiaming, Jia, Yiling, Qin, Zhen, Sanghai, Sumit, Wang, Xuanhui, Yang, Carl, Bendersky, Michael
Generating high-quality, in-depth textual documents, such as academic papers, news articles, Wikipedia entries, and books, remains a significant challenge for Large Language Models (LLMs). In this paper, we propose to use planning to generate long form content. To achieve our goal, we generate intermediate steps via an auxiliary task that teaches the LLM to plan, reason and structure before generating the final text. Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning. To overcome the scarcity of training data for these intermediate steps, we leverage LLMs to generate synthetic intermediate writing data such as outlines, key information and summaries from existing full articles. Our experiments demonstrate on two datasets from different domains, namely the scientific news dataset SciNews and Wikipedia datasets in KILT-Wiki and FreshWiki, that LLMs fine-tuned with the auxiliary task generate higher quality documents. We observed +2.5% improvement in ROUGE-Lsum, and a strong 3.60 overall win/loss ratio via human SxS evaluation, with clear wins in organization, relevance, and verifiability.
Reasoning Multi-Agent Behavioral Topology for Interactive Autonomous Driving
Liu, Haochen, Chen, Li, Qiao, Yu, Lv, Chen, Li, Hongyang
Autonomous driving system aims for safe and social-consistent driving through the behavioral integration among interactive agents. However, challenges remain due to multi-agent scene uncertainty and heterogeneous interaction. Current dense and sparse behavioral representations struggle with inefficiency and inconsistency in multi-agent modeling, leading to instability of collective behavioral patterns when integrating prediction and planning (IPP). To address this, we initiate a topological formation that serves as a compliant behavioral foreground to guide downstream trajectory generations. Specifically, we introduce Behavioral Topology (BeTop), a pivotal topological formulation that explicitly represents the consensual behavioral pattern among multi-agent future. BeTop is derived from braid theory to distill compliant interactive topology from multi-agent future trajectories. A synergistic learning framework (BeTopNet) supervised by BeTop facilitates the consistency of behavior prediction and planning within the predicted topology priors. Through imitative contingency learning, BeTop also effectively manages behavioral uncertainty for prediction and planning. Extensive verification on large-scale real-world datasets, including nuPlan and WOMD, demonstrates that BeTop achieves state-of-the-art performance in both prediction and planning tasks. Further validations on the proposed interactive scenario benchmark showcase planning compliance in interactive cases.
Imagine yourself: Tuning-Free Personalized Image Generation
He, Zecheng, Sun, Bo, Juefei-Xu, Felix, Ma, Haoyu, Ramchandani, Ankit, Cheung, Vincent, Shah, Siddharth, Kalia, Anmol, Subramanyam, Harihar, Zareian, Alireza, Chen, Li, Jain, Ankit, Zhang, Ning, Zhang, Peizhao, Sumbaly, Roshan, Vajda, Peter, Sinha, Animesh
Diffusion models have demonstrated remarkable efficacy across various image-to-image tasks. In this research, we introduce Imagine yourself, a state-of-the-art model designed for personalized image generation. Unlike conventional tuning-based personalization techniques, Imagine yourself operates as a tuning-free model, enabling all users to leverage a shared framework without individualized adjustments. Moreover, previous work met challenges balancing identity preservation, following complex prompts and preserving good visual quality, resulting in models having strong copy-paste effect of the reference images. Thus, they can hardly generate images following prompts that require significant changes to the reference image, \eg, changing facial expression, head and body poses, and the diversity of the generated images is low. To address these limitations, our proposed method introduces 1) a new synthetic paired data generation mechanism to encourage image diversity, 2) a fully parallel attention architecture with three text encoders and a fully trainable vision encoder to improve the text faithfulness, and 3) a novel coarse-to-fine multi-stage finetuning methodology that gradually pushes the boundary of visual quality. Our study demonstrates that Imagine yourself surpasses the state-of-the-art personalization model, exhibiting superior capabilities in identity preservation, visual quality, and text alignment. This model establishes a robust foundation for various personalization applications. Human evaluation results validate the model's SOTA superiority across all aspects (identity preservation, text faithfulness, and visual appeal) compared to the previous personalization models.
Pistis-RAG: A Scalable Cascading Framework Towards Trustworthy Retrieval-Augmented Generation
Bai, Yu, Miao, Yukai, Chen, Li, Li, Dan, Ren, Yanyu, Xie, Hongtao, Yang, Ce, Cai, Xuhui
In Greek mythology, Pistis symbolized good faith, trust, and reliability. Drawing inspiration from these principles, Pistis-RAG is a scalable multi-stage framework designed to address the challenges of large-scale retrieval-augmented generation (RAG) systems. This framework consists of distinct stages: matching, pre-ranking, ranking, reasoning, and aggregating. Each stage contributes to narrowing the search space, prioritizing semantically relevant documents, aligning with the large language model's (LLM) preferences, supporting complex chain-of-thought (CoT) methods, and combining information from multiple sources. Our ranking stage introduces a significant innovation by recognizing that semantic relevance alone may not lead to improved generation quality, due to the sensitivity of the few-shot prompt order, as noted in previous research. This critical aspect is often overlooked in current RAG frameworks. We argue that the alignment issue between LLMs and external knowledge ranking methods is tied to the model-centric paradigm dominant in RAG systems. We propose a content-centric approach, emphasizing seamless integration between LLMs and external information sources to optimize content transformation for specific tasks. Our novel ranking stage is designed specifically for RAG systems, incorporating principles of information retrieval while considering the unique business scenarios reflected in LLM preferences and user feedback. We simulated feedback signals on the MMLU benchmark, resulting in a 9.3% performance improvement. Our model and code will be open-sourced on GitHub. Additionally, experiments on real-world, large-scale data validate the scalability of our framework.
FedClust: Tackling Data Heterogeneity in Federated Learning through Weight-Driven Client Clustering
Islam, Md Sirajul, Javaherian, Simin, Xu, Fei, Yuan, Xu, Chen, Li, Tzeng, Nian-Feng
Federated learning (FL) is an emerging distributed machine learning paradigm that enables collaborative training of machine learning models over decentralized devices without exposing their local data. One of the major challenges in FL is the presence of uneven data distributions across client devices, violating the well-known assumption of independent-and-identically-distributed (IID) training samples in conventional machine learning. To address the performance degradation issue incurred by such data heterogeneity, clustered federated learning (CFL) shows its promise by grouping clients into separate learning clusters based on the similarity of their local data distributions. However, state-of-the-art CFL approaches require a large number of communication rounds to learn the distribution similarities during training until the formation of clusters is stabilized. Moreover, some of these algorithms heavily rely on a predefined number of clusters, thus limiting their flexibility and adaptability. In this paper, we propose {\em FedClust}, a novel approach for CFL that leverages the correlation between local model weights and the data distribution of clients. {\em FedClust} groups clients into clusters in a one-shot manner by measuring the similarity degrees among clients based on the strategically selected partial weights of locally trained models. We conduct extensive experiments on four benchmark datasets with different non-IID data settings. Experimental results demonstrate that {\em FedClust} achieves higher model accuracy up to $\sim$45\% as well as faster convergence with a significantly reduced communication cost up to 2.7$\times$ compared to its state-of-the-art counterparts.