attacker
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- North America > United States (0.04)
Label Poisoning is All You Need
In a backdoor attack, an adversary injects corrupted data into a model's training dataset in order to gain control over its predictions on images with a specific attacker-defined trigger. A typical corrupted training example requires altering both the image, by applying the trigger, and the label. Models trained on clean images, therefore, were considered safe from backdoor attacks. However, in some common machine learning scenarios, the training labels are provided by potentially malicious third-parties. This includes crowd-sourced annotation and knowledge distillation. We, hence, investigate a fundamental question: can we launch a successful backdoor attack by only corrupting labels?
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- (3 more...)
- Workflow (0.68)
- Research Report > New Finding (0.46)
- North America > United States > Indiana (0.04)
- North America > Dominican Republic (0.04)
- Europe > Greece (0.04)
- (4 more...)
c1f0b856a35986348ab3414177266f75-Paper-Conference.pdf
Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force.
- North America > United States (0.14)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- North America > United States > Virginia (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > California > Santa Clara County > San Jose (0.04)
FedGame: A Game-Theoretic Defense against Backdoor Attacks in Federated Learning
To bridge this gap, we model the strategic interactions between the defender and dynamic attackers as a minimax game. Based on the analysis of the game, we design an interactive defense mechanism FedGame. We prove that under mild assumptions, the global model trained with FedGame under backdoor attacks is close to that trained without attacks.
- North America > United States > Illinois (0.04)
- North America > United States > Pennsylvania (0.04)
- Europe > Italy (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- (18 more...)
A Proof Proof of Proposition 4.2 Proposition 4.2 The performance gap of evaluating policy profile (π, µ) and (π, π
Proof of Theorem 4.7 We first prove a Lemma. Theorem A.2. (Theorem 1 in [36]) Let ϵ = max Theorem 4.7 In a two-player game, suppose that According to Theorem A.2, we have J ( π, µ) J ( π, α) E CQL [20] puts regularization on the learning of Q function to penalize out-of-distribution actions. The CSP algorithm is illustrated in Algorithm 1. The proxy model is trained adversarially against our agent, therefore, we set the proxy's reward function to be the negative of our agent's reward. We show experiment details of the Maze example in this section.