Goto

Collaborating Authors

 malicious actor


Benchmarking Robustness to Adversarial Image Obfuscations

Neural Information Processing Systems

Automated content filtering and moderation is an important tool that allows online platforms to build striving user communities that facilitate cooperation and prevent abuse. Unfortunately, resourceful actors try to bypass automated filters in a bid to post content that violate platform policies and codes of conduct. To reach this goal, these malicious actors may obfuscate policy violating images (e.g., overlay harmful images by carefully selected benign images or visual patterns) to prevent machine learning models from reaching the correct decision. In this paper, we invite researchers to tackle this specific issue and present a new image benchmark. This benchmark, based on ImageNet, simulates the type of obfuscations created by malicious actors. It goes beyond Image-Net-C and ImageNet-C-bar by proposing general, drastic, adversarial modifications that preserve the original content intent. It aims to tackle a more common adversarial threat than the one considered by lp-norm bounded adversaries. We evaluate 33 pretrained models on the benchmark and train models with different augmentations, architectures and training methods on subsets of the obfuscations to measure generalization. Our hope is that this benchmark will encourage researchers to test their models and methods and try to find new approaches that are more robust to these obfuscations.


Socioeconomic Threats of Deepfakes and the Role of Cyber-Wellness Education in Defense

Communications of the ACM

Due to the limits of science and its steep learning curve, we must rely on the expertise of others to develop our knowledge and skills.26 Toward this end, social media platforms have revolutionized how netizens--users who are actively engaged in online communities--gain knowledge and skills by facilitating the exchange of costless information with the public (for example, followers or influencers). Businesses around the world also use these platforms along with tools based on generative artificial intelligence (GenAI) to craft synthetic media, hoping to grow revenue by attracting more customers and improving their online experience.28 Generative AI tools can empower cyber threats and have cyberpsychological effects on netizens, allowing malicious actors to craft deepfakes in the form of disinformation, misinformation, and malinformation. Service providers not only must enhance GenAI tools to reduce hallucinations, but they also have a statutory duty to mitigate data-driven biases.


When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines

Pendse, Sachin R., Gergle, Darren, Kornfield, Rachel, Meyerhoff, Jonah, Mohr, David, Suh, Jina, Wescott, Annie, Williams, Casey, Schleider, Jessica

arXiv.org Artificial Intelligence

Red-teaming is a core part of the infrastructure that ensures that AI models do not produce harmful content. Unlike past technologies, the black box nature of generative AI systems necessitates a uniquely interactional mode of testing, one in which individuals on red teams actively interact with the system, leveraging natural language to simulate malicious actors and solicit harmful outputs. This interactional labor done by red teams can result in mental health harms that are uniquely tied to the adversarial engagement strategies necessary to effectively red team. The importance of ensuring that generative AI models do not propagate societal or individual harm is widely recognized -- one less visible foundation of end-to-end AI safety is also the protection of the mental health and wellbeing of those who work to keep model outputs safe. In this paper, we argue that the unmet mental health needs of AI red-teamers is a critical workplace safety concern. Through analyzing the unique mental health impacts associated with the labor done by red teams, we propose potential individual and organizational strategies that could be used to meet these needs, and safeguard the mental health of red-teamers. We develop our proposed strategies through drawing parallels between common red-teaming practices and interactional labor common to other professions (including actors, mental health professionals, conflict photographers, and content moderators), describing how individuals and organizations within these professional spaces safeguard their mental health given similar psychological demands. Drawing on these protective practices, we describe how safeguards could be adapted for the distinct mental health challenges experienced by red teaming organizations as they mitigate emerging technological risks on the new digital frontlines.


Urgent warning as 1.5 MILLION private photos are leaked from BDSM dating apps - so, have your sexy snaps been exposed?

Daily Mail - Science & tech

Cybersecurity researchers have issued an urgent warning as almost 1.5 million private photos from dating apps are exposed. Affected apps include the kink dating sites BDSM People and CHICA, as well as LGBT dating services PINK, BRISH, and TRANSLOVE - all of which were developed by M.A.D Mobile. The leaked files include photos used for verification, photos removed by app moderators, and photos sent in direct messages between users - many of which were explicit. These sensitive snaps were being stored online without password protection, meaning anyone with the link could view and download them. Researchers from Cybernews, who discovered the vulnerability, say this easily exploited security flaw put up to 900,000 users at risk of further hacks or extortion.


Beyond Release: Access Considerations for Generative AI Systems

Solaiman, Irene, Bommasani, Rishi, Hendrycks, Dan, Herbert-Voss, Ariel, Jernite, Yacine, Skowron, Aviya, Trask, Andrew

arXiv.org Artificial Intelligence

Generative AI release decisions determine whether system components are made available, but release does not address many other elements that change how users and stakeholders are able to engage with a system. Beyond release, access to system components informs potential risks and benefits. Access refers to practical needs, infrastructurally, technically, and societally, in order to use available components in some way. We deconstruct access along three axes: resourcing, technical usability, and utility. Within each category, a set of variables per system component clarify tradeoffs. For example, resourcing requires access to computing infrastructure to serve model weights. We also compare the accessibility of four high performance language models, two open-weight and two closed-weight, showing similar considerations for all based instead on access variables. Access variables set the foundation for being able to scale or increase access to users; we examine the scale of access and how scale affects ability to manage and intervene on risks. This framework better encompasses the landscape and risk-benefit tradeoffs of system releases to inform system release decisions, research, and policy.


Position: Editing Large Language Models Poses Serious Safety Risks

Youssef, Paul, Zhao, Zhixue, Braun, Daniel, Schlötterer, Jörg, Seifert, Christin

arXiv.org Artificial Intelligence

Large Language Models (LLMs) contain large amounts of facts about the world. These facts can become outdated over time, which has led to the development of knowledge editing methods (KEs) that can change specific facts in LLMs with limited side effects. This position paper argues that editing LLMs poses serious safety risks that have been largely overlooked. First, we note the fact that KEs are widely available, computationally inexpensive, highly performant, and stealthy makes them an attractive tool for malicious actors. Second, we discuss malicious use cases of KEs, showing how KEs can be easily adapted for a variety of malicious purposes. Third, we highlight vulnerabilities in the AI ecosystem that allow unrestricted uploading and downloading of updated models without verification. Fourth, we argue that a lack of social and institutional awareness exacerbates this risk, and discuss the implications for different stakeholders. We call on the community to (i) research tamper-resistant models and countermeasures against malicious model editing, and (ii) actively engage in securing the AI ecosystem.


Benchmarking Robustness to Adversarial Image Obfuscations

Neural Information Processing Systems

Automated content filtering and moderation is an important tool that allows online platforms to build striving user communities that facilitate cooperation and prevent abuse. Unfortunately, resourceful actors try to bypass automated filters in a bid to post content that violate platform policies and codes of conduct. To reach this goal, these malicious actors may obfuscate policy violating images (e.g., overlay harmful images by carefully selected benign images or visual patterns) to prevent machine learning models from reaching the correct decision. In this paper, we invite researchers to tackle this specific issue and present a new image benchmark. This benchmark, based on ImageNet, simulates the type of obfuscations created by malicious actors.


What AI evaluations for preventing catastrophic risks can and cannot do

Barnett, Peter, Thiergart, Lisa

arXiv.org Artificial Intelligence

AI evaluations are an important component of the AI governance toolkit, underlying current approaches to safety cases for preventing catastrophic risks. Our paper examines what these evaluations can and cannot tell us. Evaluations can establish lower bounds on AI capabilities and assess certain misuse risks given sufficient effort from evaluators. Unfortunately, evaluations face fundamental limitations that cannot be overcome within the current paradigm. These include an inability to establish upper bounds on capabilities, reliably forecast future model capabilities, or robustly assess risks from autonomous AI systems. This means that while evaluations are valuable tools, we should not rely on them as our main way of ensuring AI systems are safe. We conclude with recommendations for incremental improvements to frontier AI safety, while acknowledging these fundamental limitations remain unsolved.


Declare and Justify: Explicit assumptions in AI evaluations are necessary for effective regulation

Barnett, Peter, Thiergart, Lisa

arXiv.org Artificial Intelligence

As AI systems advance, AI evaluations are becoming an important pillar of regulations for ensuring safety. We argue that such regulation should require developers to explicitly identify and justify key underlying assumptions about evaluations as part of their case for safety. We identify core assumptions in AI evaluations (both for evaluating existing models and forecasting future models), such as comprehensive threat modeling, proxy task validity, and adequate capability elicitation. Many of these assumptions cannot currently be well justified. If regulation is to be based on evaluations, it should require that AI development be halted if evaluations demonstrate unacceptable danger or if these assumptions are inadequately justified. Our presented approach aims to enhance transparency in AI development, offering a practical path towards more effective governance of advanced AI systems.


On the Abuse and Detection of Polyglot Files

Koch, Luke, Oesch, Sean, Chaulagain, Amul, Dixon, Jared, Dixon, Matthew, Huettal, Mike, Sadovnik, Amir, Watson, Cory, Weber, Brian, Hartman, Jacob, Patulski, Richard

arXiv.org Artificial Intelligence

A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures, as well as file upload and sanitization tools. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild, leaving organizations vulnerable to attack. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding $30$ polyglot samples and $15$ attack chains that leveraged polyglot files. In this report, we highlight two well-known APTs whose cyber attack chains relied on polyglot files to bypass detection mechanisms. Using knowledge from our survey of polyglot usage in the wild -- the first of its kind -- we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of $0.999$ with an F1 score of $99.20$% for polyglot detection and $99.47$% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized $100$% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.