Xing, Xinyu
UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality
Cheng, Zelei, Cai, Xin-Qiang, Tang, Yuting, Zhang, Pushi, Yang, Boming, Xing, Xinyu
Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs) with human values. However, existing approaches struggle to capture the multi-dimensional, distributional nuances of human preferences. Methods such as RiC that directly inject raw reward values into prompts face significant numerical sensitivity issues--for instance, LLMs may fail to distinguish between 9.11 and 9.8--while alternatives like MORLHF, Rewarded Soups, and MODPO incur high computational costs by training multiple models. In this work, we introduce Utility-Conditioned Multi-Objective Alignment (UC-MOA), a novel framework that overcomes these limitations. Our approach leverages a diverse set of strictly increasing, non-linear utility functions to transform user-specified preferences into symbolic tokens, which are then used to condition a single LLM. This design not only mitigates numerical reasoning challenges but also substantially reduces training overhead, yielding models that achieve superior Pareto fronts and robust alignment across complex reward dimensions.
A Survey on Explainable Deep Reinforcement Learning
Cheng, Zelei, Yu, Jiahao, Xing, Xinyu
Deep Reinforcement Learning (DRL) has achieved remarkable success in sequential decision-making tasks across diverse domains, yet its reliance on black-box neural architectures hinders interpretability, trust, and deployment in high-stakes applications. Explainable Deep Reinforcement Learning (XRL) addresses these challenges by enhancing transparency through feature-level, state-level, dataset-level, and model-level explanation techniques. This survey provides a comprehensive review of XRL methods, evaluates their qualitative and quantitative assessment frameworks, and explores their role in policy refinement, adversarial robustness, and security. Additionally, we examine the integration of reinforcement learning with Large Language Models (LLMs), particularly through Reinforcement Learning from Human Feedback (RLHF), which optimizes AI alignment with human preferences. We conclude by highlighting open research challenges and future directions to advance the development of interpretable, reliable, and accountable DRL systems.
Soft-Label Integration for Robust Toxicity Classification
Cheng, Zelei, Wu, Xian, Yu, Jiahao, Han, Shuo, Cai, Xin-Qiang, Xing, Xinyu
Toxicity classification in textual content remains a significant problem. Data with labels from a single annotator fall short of capturing the diversity of human perspectives. Therefore, there is a growing need to incorporate crowdsourced annotations for training an effective toxicity classifier. Additionally, the standard approach to training a classifier using empirical risk minimization (ERM) may fail to address the potential shifts between the training set and testing set due to exploiting spurious correlations. This work introduces a novel bi-level optimization framework that integrates crowdsourced annotations with the soft-labeling technique and optimizes the soft-label weights by Group Distributionally Robust Optimization (GroupDRO) to enhance the robustness against out-of-distribution (OOD) risk. We theoretically prove the convergence of our bi-level optimization algorithm. Experimental results demonstrate that our approach outperforms existing baseline methods in terms of both average and worst-group accuracy, confirming its effectiveness in leveraging crowdsourced annotations to achieve more effective and robust toxicity classification.
BlockFound: Customized blockchain foundation model for anomaly detection
Yu, Jiahao, Wu, Xian, Liu, Hao, Guo, Wenbo, Xing, Xinyu
We propose BlockFound, a customized foundation model for anomaly blockchain transaction detection. Unlike existing methods that rely on rule-based systems or directly apply off-the-shelf large language models, BlockFound introduces a series of customized designs to model the unique data structure of blockchain transactions. First, a blockchain transaction is multi-modal, containing blockchain-specific tokens, texts, and numbers. We design a modularized tokenizer to handle these multi-modal inputs, balancing the information across different modalities. Second, we design a customized mask language learning mechanism for pretraining with RoPE embedding and FlashAttention for handling longer sequences. Extensive evaluations on Ethereum and Solana transactions demonstrate BlockFound's exceptional capability in anomaly detection while maintaining a low false positive rate. Remarkably, BlockFound is the only method that successfully detects anomalous transactions on Solana with high accuracy, whereas all other approaches achieved very low or zero detection recall scores. This work not only provides new foundation models for blockchain but also sets a new benchmark for applying LLMs in blockchain data. With the rapid development of blockchain technology, cryptocurrencies have gained significant attention and are increasingly being used in real-world applications. A lot of Decentralized Finance (DeFi) protocols have emerged, offering a wide range of financial services, such as lending, borrowing, and trading, to users. However, the decentralized nature of these protocols also makes them vulnerable to various security threats, including the presence of malicious attacks such as doublespending attack (Karame et al., 2012), partition attacks (Saad et al., 2019), and front-running attacks (Eskandari et al., 2020). These attacks seriously threaten the asset security of billions of blockchain users.
UTF:Undertrained Tokens as Fingerprints A Novel Approach to LLM Identification
Cai, Jiacheng, Yu, Jiahao, Shao, Yangguang, Wu, Yuhang, Xing, Xinyu
Fingerprinting large language models (LLMs) is essential for verifying model ownership, ensuring authenticity, and preventing misuse. Traditional fingerprinting methods often require significant computational overhead or white-box verification access. In this paper, we introduce UTF, a novel and efficient approach to fingerprinting LLMs by leveraging under-trained tokens. Under-trained tokens are tokens that the model has not fully learned during its training phase. By utilizing these tokens, we perform supervised fine-tuning to embed specific input-output pairs into the model. This process allows the LLM to produce predetermined outputs when presented with certain inputs, effectively embedding a unique fingerprint. Our method has minimal overhead and impact on model's performance, and does not require white-box access to target model's ownership identification. Compared to existing fingerprinting methods, UTF is also more effective and robust to fine-tuning and random guess.
PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs
Yu, Jiahao, Shao, Yangguang, Miao, Hanwen, Shi, Junzheng, Xing, Xinyu
Large Language Models (LLMs) have gained widespread use in various applications due to their powerful capability to generate human-like text. However, prompt injection attacks, which involve overwriting a model's original instructions with malicious prompts to manipulate the generated text, have raised significant concerns about the security and reliability of LLMs. Ensuring that LLMs are robust against such attacks is crucial for their deployment in real-world applications, particularly in critical tasks. In this paper, we propose PROMPTFUZZ, a novel testing framework that leverages fuzzing techniques to systematically assess the robustness of LLMs against prompt injection attacks. Inspired by software fuzzing, PROMPTFUZZ selects promising seed prompts and generates a diverse set of prompt injections to evaluate the target LLM's resilience. PROMPTFUZZ operates in two stages: the prepare phase, which involves selecting promising initial seeds and collecting few-shot examples, and the focus phase, which uses the collected examples to generate diverse, high-quality prompt injections. Using PROMPTFUZZ, we can uncover more vulnerabilities in LLMs, even those with strong defense prompts. By deploying the generated attack prompts from PROMPTFUZZ in a real-world competition, we achieved the 7th ranking out of over 4000 participants (top 0.14%) within 2 hours. Additionally, we construct a dataset to fine-tune LLMs for enhanced robustness against prompt injection attacks. While the fine-tuned model shows improved robustness, PROMPTFUZZ continues to identify vulnerabilities, highlighting the importance of robust testing for LLMs. Our work emphasizes the critical need for effective testing tools and provides a practical framework for evaluating and improving the robustness of LLMs against prompt injection attacks.
Decoupled Alignment for Robust Plug-and-Play Adaptation
Luo, Haozheng, Yu, Jiahao, Zhang, Wenxin, Li, Jialong, Hu, Jerry Yao-Chieh, Xing, Xinyu, Liu, Han
This innovation is practically urgent and important. LLMs have been widely adopted in various applications recently, demonstrating their ability to generate high-quality human-like texts [Team et al., 2024, Touvron et al., 2023, Ivison et al., 2023]. However, the security of these models has become a significant concern due to the potential risks of generating harmful content [Wu et al., 2024a, Yu et al., 2024, 2023a, Chao et al., 2023, Deng et al., 2023]. To align the LLMs with ethical guidelines, researchers have developed various methods to enhance their safety. For example, the Llama-2-Chat [Touvron et al., 2023] and Gemma-it [Team et al., 2024] models have been extensively fine-tuned to improve their alignment performance. However, these methods often require extensive computational resources or manual red-teaming, which can be costly and time-consuming [Team et al., 2024, OpenAI, 2024, Bai et al., 2022, Ganguli et al., 2022]. Thus, most of the LLMs finetuned from the pre-trained models by third-party developers do not undergo the alignment process [Xu et al., 2024a, Chiang et al., 2023, Ivison et al., 2023], leaving them vulnerable to generating harmful content by users with malicious intent. To combat these issues, we seek motivations from knowledge distillation technologies [Xu et al., 2024b, Hahn and Choi, 2019], where a teacher model's knowledge is transferred to a student model. Specifically, through numerical experiments Figure 3 and Figure 4, we make two key detections: MLP Alignment.
RICE: Breaking Through the Training Bottlenecks of Reinforcement Learning with Explanation
Cheng, Zelei, Wu, Xian, Yu, Jiahao, Yang, Sabrina, Wang, Gang, Xing, Xinyu
Deep reinforcement learning (DRL) is playing an increasingly important role in real-world applications. However, obtaining an optimally performing DRL agent for complex tasks, especially with sparse rewards, remains a significant challenge. The training of a DRL agent can be often trapped in a bottleneck without further progress. In this paper, we propose RICE, an innovative refining scheme for reinforcement learning that incorporates explanation methods to break through the training bottlenecks. The high-level idea of RICE is to construct a new initial state distribution that combines both the default initial states and critical states identified through explanation methods, thereby encouraging the agent to explore from the mixed initial states. Through careful design, we can theoretically guarantee that our refining scheme has a tighter sub-optimality bound. We evaluate RICE in various popular RL environments and real-world applications. The results demonstrate that RICE significantly outperforms existing refining schemes in enhancing agent performance.
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
Yu, Jiahao, Luo, Haozheng, Hu, Jerry Yao-Chieh, Guo, Wenbo, Liu, Han, Xing, Xinyu
Along with the remarkable successes of Language language models, recent research also started to explore the security threats of LLMs, including jailbreaking attacks. Attackers carefully craft jailbreaking prompts such that a target LLM will respond to the harmful question. Existing jailbreaking attacks require either human experts or leveraging complicated algorithms to craft jailbreaking prompts. In this paper, we introduce BOOST, a simple attack that leverages only the eos tokens. We demonstrate that rather than constructing complicated jailbreaking prompts, the attacker can simply append a few eos tokens to the end of a harmful question. It will bypass the safety alignment of LLMs and lead to successful jailbreaking attacks. We further apply BOOST to four representative jailbreak methods and show that the attack success rates of these methods can be significantly enhanced by simply adding eos tokens to the prompt. To understand this simple but novel phenomenon, we conduct empirical analyses. Our analysis reveals that adding eos tokens makes the target LLM believe the input is much less harmful, and eos tokens have low attention values and do not affect LLM's understanding of the harmful questions, leading the model to actually respond to the questions. Our findings uncover how fragile an LLM is against jailbreak attacks, motivating the development of strong safety alignment approaches.
Assessing Prompt Injection Risks in 200+ Custom GPTs
Yu, Jiahao, Wu, Yuhang, Shu, Dong, Jin, Mingyu, Xing, Xinyu
In the rapidly evolving landscape of artificial intelligence, ChatGPT has been widely used in various applications. The new feature -- customization of ChatGPT models by users to cater to specific needs has opened new frontiers in AI utility. However, this study reveals a significant security vulnerability inherent in these user-customized GPTs: prompt injection attacks. Through comprehensive testing of over 200 user-designed GPT models via adversarial prompts, we demonstrate that these systems are susceptible to prompt injections. Through prompt injection, an adversary can not only extract the customized system prompts but also access the uploaded files. This paper provides a first-hand analysis of the prompt injection, alongside the evaluation of the possible mitigation of such attacks. Our findings underscore the urgent need for robust security frameworks in the design and deployment of customizable GPT models. The intent of this paper is to raise awareness and prompt action in the AI community, ensuring that the benefits of GPT customization do not come at the cost of compromised security and privacy.