Goto

Collaborating Authors

 Squicciarini, Anna


The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents

arXiv.org Artificial Intelligence

Large Language Model (LLM) agents are increasingly being deployed as conversational assistants capable of performing complex real-world tasks through tool integration. This enhanced ability to interact with external systems and process various data sources, while powerful, introduces significant security vulnerabilities. In particular, indirect prompt injection attacks pose a critical threat, where malicious instructions embedded within external data sources can manipulate agents to deviate from user intentions. While existing defenses based on rule constraints, source spotlighting, and authentication protocols show promise, they struggle to maintain robust security while preserving task functionality. We propose a novel and orthogonal perspective that reframes agent security from preventing harmful actions to ensuring task alignment, requiring every agent action to serve user objectives. Based on this insight, we develop Task Shield, a test-time defense mechanism that systematically verifies whether each instruction and tool call contributes to user-specified goals. Through experiments on the AgentDojo benchmark, we demonstrate that Task Shield reduces attack success rates (2.07\%) while maintaining high task utility (69.79\%) on GPT-4o.


RoCourseNet: Distributionally Robust Training of a Prediction Aware Recourse Model

arXiv.org Artificial Intelligence

Counterfactual (CF) explanations for machine learning (ML) models are preferred by end-users, as they explain the predictions of ML models by providing a recourse (or contrastive) case to individuals who are adversely impacted by predicted outcomes. Existing CF explanation methods generate recourses under the assumption that the underlying target ML model remains stationary over time. However, due to commonly occurring distributional shifts in training data, ML models constantly get updated in practice, which might render previously generated recourses invalid and diminish end-users trust in our algorithmic framework. To address this problem, we propose RoCourseNet, a training framework that jointly optimizes predictions and recourses that are robust to future data shifts. This work contains four key contributions: (1) We formulate the robust recourse generation problem as a tri-level optimization problem which consists of two sub-problems: (i) a bi-level problem that finds the worst-case adversarial shift in the training data, and (ii) an outer minimization problem to generate robust recourses against this worst-case shift. (2) We leverage adversarial training to solve this tri-level optimization problem by: (i) proposing a novel virtual data shift (VDS) algorithm to find worst-case shifted ML models via explicitly considering the worst-case data shift in the training dataset, and (ii) a block-wise coordinate descent procedure to optimize for prediction and corresponding robust recourses. (3) We evaluate RoCourseNet's performance on three real-world datasets, and show that RoCourseNet consistently achieves more than 96% robust validity and outperforms state-of-the-art baselines by at least 10% in generating robust CF explanations. (4) Finally, we generalize the RoCourseNet framework to accommodate any parametric post-hoc methods for improving robust validity.


Automated Detection of Doxing on Twitter

arXiv.org Artificial Intelligence

The term"dox" is an abbreviation for"documents," and doxing is the act of disclosing private, sensitive, or personally identifiable information about a person without their consent. Sensitive information can be considered as any type of confidential information or any information that can be used to identify a person uniquely. This information is called doxed information and includes demographic information [53] such as birthday, sexual orientation, race, ethnicity, and religion, or location information which can be used to precisely or approximately locate a person such as the street address, ZIP code, IP address, and GPS coordinates. Other categories of doxed information are identity documents like passport number and social security number, contact information like phone number and email address, financial information such as credit card and bank account details, or sign-in credentials such as usernames and passwords[15]. Such disclosure may have various consequences. It may encourage forms of bigotry and hate groups, encourage human or child trafficking and endanger people's lives or reputations, scare and intimidate people by swatting


A Synthetic Prediction Market for Estimating Confidence in Published Work

arXiv.org Artificial Intelligence

Explainably estimating confidence in published scholarly work offers opportunity for faster and more robust scientific progress. We develop a synthetic prediction market to assess the credibility of published claims in the social and behavioral sciences literature. We demonstrate our system and detail our findings using a collection of known replication projects. We suggest that this work lays the foundation for a research agenda that creatively uses AI for peer review.


Uncovering Scene Context for Predicting Privacy of Online Shared Images

AAAI Conferences

With the exponential increase in the number of images that are shared online every day, the development of effective and efficient learning methods for image privacy prediction has become crucial. Prior works have used as features automatically derived object tags from images' content and manually annotated user tags. However, we believe that in addition to objects, the scene context obtained from images’ content can improve the performance of privacy prediction. Hence, we propose to uncover scene-based tags from images' content using convolutional neural networks. Experimental results on a Flickr dataset show that the scene tags and object tags complement each other and yield the best performance when used in combination with user tags.