AITopics | Qi, Yanjun

Collaborating Authors

Qi, Yanjun

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Augmented Adversarial Trigger Learning

Wang, Zhe, Qi, Yanjun

arXiv.org Artificial IntelligenceMar-15-2025

Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies into a weighted loss formulation that encourages the learned adversarial triggers to optimize more towards response format tokens. This enables ATLA to learn an adversarial trigger from just one query-response pair and the learned trigger generalizes well to other similar queries. We further design a variation to augment trigger optimization with an auxiliary loss that suppresses evasive responses. We showcase how to use ATLA to learn adversarial suffixes jailbreaking LLMs and to extract hidden system prompts. Empirically we demonstrate that ATLA consistently outperforms current state-of-the-art techniques, achieving nearly 100% success in attacking while requiring 80% fewer queries. ATLA learned jailbreak suffixes demonstrate high generalization to unseen queries and transfer well to new LLMs.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2503.12339

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (0.46)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice

Goel, Aman, Wu, Xian Carrie, Wang, Zhe, Bespalov, Dmitriy, Qi, Yanjun

arXiv.org Artificial IntelligenceFeb-21-2025

Jailbreaking large-language models (LLMs) involves testing their robustness against adversarial prompts and evaluating their ability to withstand prompt attacks that could elicit unauthorized or malicious responses. In this paper, we present TurboFuzzLLM, a mutation-based fuzzing technique for efficiently finding a collection of effective jailbreaking templates that, when combined with harmful questions, can lead a target LLM to produce harmful responses through black-box access via user prompts. We describe the limitations of directly applying existing template-based attacking techniques in practice, and present functional and efficiency-focused upgrades we added to mutation-based fuzzing to generate effective jailbreaking templates automatically. TurboFuzzLLM achieves $\geq$ 95\% attack success rates (ASR) on public datasets for leading LLMs (including GPT-4o \& GPT-4 Turbo), shows impressive generalizability to unseen harmful questions, and helps in improving model defenses to prompt attacks.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2502.18504

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

A Comprehensive Survey on Concept Erasure in Text-to-Image Diffusion Models

Kim, Changhoon, Qi, Yanjun

arXiv.org Artificial IntelligenceFeb-17-2025

Text-to-Image (T2I) models have made remarkable progress in generating high-quality, diverse visual content from natural language prompts. However, their ability to reproduce copyrighted styles, sensitive imagery, and harmful content raises significant ethical and legal concerns. Concept erasure offers a proactive alternative to external filtering by modifying T2I models to prevent the generation of undesired content. In this survey, we provide a structured overview of concept erasure, categorizing existing methods based on their optimization strategies and the architectural components they modify. We categorize concept erasure methods into fine-tuning for parameter updates, closed-form solutions for efficient edits, and inference-time interventions for content restriction without weight modification. Additionally, we explore adversarial attacks that bypass erasure techniques and discuss emerging defenses. To support further research, we consolidate key datasets, evaluation metrics, and benchmarks for assessing erasure effectiveness and model robustness. This survey serves as a comprehensive resource, offering insights into the evolving landscape of concept erasure, its challenges, and future directions.

diffusion model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.14896

Genre:

Overview (1.00)
Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Add feedback

Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator

Chen, Zhuotong, Liu, Fang, Zhu, Xuan, Qi, Yanjun, Ghavamzadeh, Mohammad

arXiv.org Artificial IntelligenceFeb-6-2025

Existing studies on preference optimization (PO) have centered on constructing pairwise preference data following simple heuristics, such as maximizing the margin between preferred and dispreferred completions based on human (or AI) ranked scores. However, none of these heuristics has a full theoretical justification. In this work, we develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions. To achieve this, we formulate PO as minimizing the negative log-likelihood (NLL) of a probability model and propose to estimate its normalization constant via a sampling strategy. As we will demonstrate, these estimative samples can act as dispreferred completions in PO. We then select contrastive divergence (CD) as the sampling strategy, and propose a novel MC-PO algorithm that applies the Monte Carlo (MC) kernel from CD to sample hard negatives w.r.t. the parameterized reward model. Finally, we propose the OnMC-PO algorithm, an extension of MC-PO to the online setting. On popular alignment benchmarks, MC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.

completion, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.04567

Country: North America > United States > California > Santa Clara County (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Add feedback

Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation

Schwartz, Daniel, Bespalov, Dmitriy, Wang, Zhe, Kulkarni, Ninad, Qi, Yanjun

arXiv.org Artificial IntelligenceJan-28-2025

We present a modular pipeline that automates the generation of stealthy jailbreak prompts derived from high-level content policies, enhancing LLM content moderation. First, we address query inefficiency and jailbreak strength by developing Graph of Attacks with Pruning (GAP), a method that utilizes strategies from prior jailbreaks, resulting in 92% attack success rate on GPT-3.5 using only 54% of the queries of the prior algorithm. Second, we address the cold-start issue by automatically generating seed prompts from the high-level policy using LLMs. Finally, we demonstrate the utility of these generated jailbreak prompts of improving content moderation by fine-tuning PromptGuard, a model trained to detect jailbreaks, increasing its accuracy on the Toxic-Chat dataset from 5.1% to 93.89%.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.18638

Genre: Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)

Add feedback

Hierarchical Prompt Decision Transformer: Improving Few-Shot Policy Generalization with Global and Adaptive Guidance

Wang, Zhe, Wang, Haozhu, Qi, Yanjun

arXiv.org Artificial IntelligenceDec-12-2024

Decision transformers recast reinforcement learning as a conditional sequence generation problem, offering a simple but effective alternative to traditional value or policy-based methods. A recent key development in this area is the integration of prompting in decision transformers to facilitate few-shot policy generalization. However, current methods mainly use static prompt segments to guide rollouts, limiting their ability to provide context-specific guidance. Addressing this, we introduce a hierarchical prompting approach enabled by retrieval augmentation. Our method learns two layers of soft tokens as guiding prompts: (1) global tokens encapsulating task-level information about trajectories, and (2) adaptive tokens that deliver focused, timestep-specific instructions. The adaptive tokens are dynamically retrieved from a curated set of demonstration segments, ensuring context-aware guidance. Experiments across seven benchmark tasks in the MuJoCo and MetaWorld environments demonstrate the proposed approach consistently outperforms all baseline methods, suggesting that hierarchical prompting for decision transformers is an effective strategy to enable few-shot policy generalization.

hpdt, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2412.00979

Country: North America > United States (0.68)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Add feedback

AIDE: Task-Specific Fine Tuning with Attribute Guided Multi-Hop Data Expansion

Li, Jiayu, Zhu, Xuan, Liu, Fang, Qi, Yanjun

arXiv.org Artificial IntelligenceDec-8-2024

Fine-tuning large language models (LLMs) for specific tasks requires high-quality, diverse training data relevant to the task. Recent research has leveraged LLMs to synthesize training data, but existing approaches either depend on large seed datasets or struggle to ensure both task relevance and data diversity in the generated outputs. To address these challenges, we propose AIDE, a novel data synthesis framework that uses a multi-hop process to expand 10 seed data points while ensuring diversity and task relevance. AIDE extracts the main topic and key knowledge attributes from the seed data to guide the synthesis process. In each subsequent hop, it extracts the topic and attributes from the newly generated data and continues guided synthesis. This process repeats for a total of K hops. To prevent irrelevant data generation as the hop depth increases, AIDE incorporates a residual connection mechanism and uses self-reflection to improve data quality. Our empirical results demonstrate that fine-tuning Mistral-7B, Llama-3.1-8B and Llama-3.2-3B with AIDE achieves more than 10% accuracy improvements over the base models across 13 tasks from 5 different benchmarks, while outperforming the models fine-tuned with state-of-the-art data synthesis methods like Evol-Instruct, DataTune and Prompt2Model.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.06136

Country: North America > Mexico (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

Chen, Zhuotong, Liu, Fang, Zhu, Jennifer, Du, Wanyu, Qi, Yanjun

arXiv.org Artificial IntelligenceNov-7-2024

Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from unstable preference optimization. In this work, we aim to improve the preference optimization pipeline by taking a closer look at preference data generation and training regularization techniques. For preference data generation, we demonstrate that existing scoring-based reward models produce unsatisfactory preference data and perform poorly on out-of-distribution tasks. This significantly impacts the LLM alignment performance when using these data for preference tuning. To ensure high-quality preference data generation, we propose an iterative pairwise ranking mechanism that derives preference ranking of completions using pairwise comparison signals. For training regularization, we observe that preference optimization tends to achieve better convergence when the LLM predicted likelihood of preferred samples gets slightly reduced. However, the widely used supervised next-word prediction regularization strictly prevents any likelihood reduction of preferred samples. This observation motivates our design of a budget-controlled regularization formulation. Empirically we show that combining the two designs leads to aligned models that surpass existing SOTA across two popular benchmarks.

completion, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2411.05875

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

DFlow: Diverse Dialogue Flow Simulation with Large Language Models

Du, Wanyu, Feng, Song, Gung, James, Sun, Lijia, Zhang, Yi, Mansour, Saab, Qi, Yanjun

arXiv.org Artificial IntelligenceOct-18-2024

Developing language model-based dialogue agents requires effective data to train models that can follow specific task logic. However, most existing data augmentation methods focus on increasing diversity in language, topics, or dialogue acts at the utterance level, largely neglecting a critical aspect of task logic diversity at the dialogue level. This paper proposes a novel data augmentation method designed to enhance the diversity of synthetic dialogues by focusing on task execution logic. Our method uses LLMs to generate decision tree-structured task plans, which enables the derivation of diverse dialogue trajectories for a given task. Each trajectory, referred to as a "dialog flow", guides the generation of a multi-turn dialogue that follows a unique trajectory. We apply this method to generate a task-oriented dialogue dataset comprising 3,886 dialogue flows across 15 different domains. We validate the effectiveness of this dataset using the next action prediction task, where models fine-tuned on our dataset outperform strong baselines, including GPT-4. Upon acceptance of this paper, we plan to release the code and data publicly.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2410.14853

Country:

North America > Canada (0.14)
Europe > Middle East > Malta (0.14)
Europe > Belgium (0.14)
(3 more...)

Genre:

Research Report (0.82)
Workflow (0.71)

Industry:

Banking & Finance (0.68)
Energy > Oil & Gas > Upstream (0.50)
Transportation > Passenger (0.46)
Transportation > Ground (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

TaeBench: Improving Quality of Toxic Adversarial Examples

Zhu, Xuan, Bespalov, Dmitriy, You, Liwen, Kulkarni, Ninad, Qi, Yanjun

arXiv.org Artificial IntelligenceOct-7-2024

Toxicity text detectors can be vulnerable to adversarial examples - small perturbations to input text that fool the systems into wrong detection. Existing attack algorithms are time-consuming and often produce invalid or ambiguous adversarial examples, making them less useful for evaluating or improving real-world toxicity content moderators. This paper proposes an annotation pipeline for quality control of generated toxic adversarial examples (TAE). We design model-based automated annotation and human-based quality verification to assess the quality requirements of TAE. Successful TAE should fool a target toxicity model into making benign predictions, be grammatically reasonable, appear natural like human-generated text, and exhibit semantic toxicity. When applying these requirements to more than 20 state-of-the-art (SOTA) TAE attack recipes, we find many invalid samples from a total of 940k raw TAE attack generations. We then utilize the proposed pipeline to filter and curate a high-quality TAE dataset we call TaeBench (of size 264k). Empirically, we demonstrate that TaeBench can effectively transfer-attack SOTA toxicity content moderation models and services. Our experiments also show that TaeBench with adversarial training achieve significant improvements of the robustness of two toxicity detectors.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2410.05573

Country: North America > United States (0.68)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
(2 more...)

Add feedback