Goto

Collaborating Authors

 Law


Multi-head Reward Aggregation Guided by Entropy

arXiv.org Artificial Intelligence

Aligning large language models (LLMs) with safety guidelines typically involves reinforcement learning from human feedback (RLHF), relying on human-generated preference annotations. However, assigning consistent overall quality ratings is challenging, prompting recent research to shift towards detailed evaluations based on multiple specific safety criteria. This paper uncovers a consistent observation: safety rules characterized by high rating entropy are generally less reliable in identifying responses preferred by humans. Leveraging this finding, we introduce ENCORE, a straightforward entropy-guided approach that composes multi-head rewards by downweighting rules exhibiting high rating entropy. Theoretically, we demonstrate that rules with elevated entropy naturally receive minimal weighting in the Bradley-Terry optimization framework, justifying our entropy-based penalization. Through extensive experiments on RewardBench safety tasks, our method significantly surpasses several competitive baselines, including random weighting, uniform weighting, single-head Bradley-Terry models, and LLM-based judging methods. Our proposed approach is training-free, broadly applicable to various datasets, and maintains interpretability, offering a practical and effective solution for multi-attribute reward modeling.


ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning

arXiv.org Artificial Intelligence

Autonomous agents powered by foundation models have seen widespread adoption across various real-world applications. However, they remain highly vulnerable to malicious instructions and attacks, which can result in severe consequences such as privacy breaches and financial losses. More critically, existing guardrails for LLMs are not applicable due to the complex and dynamic nature of agents. To tackle these challenges, we propose ShieldAgent, the first guardrail agent designed to enforce explicit safety policy compliance for the action trajectory of other protected agents through logical reasoning. Specifically, ShieldAgent first constructs a safety policy model by extracting verifiable rules from policy documents and structuring them into a set of action-based probabilistic rule circuits. Given the action trajectory of the protected agent, ShieldAgent retrieves relevant rule circuits and generates a shielding plan, leveraging its comprehensive tool library and executable code for formal verification. In addition, given the lack of guardrail benchmarks for agents, we introduce ShieldAgent-Bench, a dataset with 3K safety-related pairs of agent instructions and action trajectories, collected via SOTA attacks across 6 web environments and 7 risk categories. Experiments show that ShieldAgent achieves SOTA on ShieldAgent-Bench and three existing benchmarks, outperforming prior methods by 11.3% on average with a high recall of 90.1%. Additionally, ShieldAgent reduces API queries by 64.7% and inference time by 58.2%, demonstrating its high precision and efficiency in safeguarding agents.


Sociotechnical Effects of Machine Translation

arXiv.org Artificial Intelligence

While the previous chapters have shown how machine translation (MT) can be useful, in this chapter we discuss some of the side-effects and risks that are associated, and how they might be mitigated. With the move to neural MT and approaches using Large Language Models (LLMs), there is an associated impact on climate change, as the models built by multinational corporations are massive. They are hugely expensive to train, consume large amounts of electricity, and output huge volumes of kgCO2 to boot. However, smaller models which still perform to a high level of quality can be built with much lower carbon footprints, and tuning pre-trained models saves on the requirement to train from scratch. We also discuss the possible detrimental effects of MT on translators and other users. The topics of copyright and ownership of data are discussed, as well as ethical considerations on data and MT use. Finally, we show how if done properly, using MT in crisis scenarios can save lives, and we provide a method of how this might be done.


Synthetic Data Augmentation for Cross-domain Implicit Discourse Relation Recognition

arXiv.org Artificial Intelligence

Implicit discourse relation recognition (IDRR) -- the task of identifying the implicit coherence relation between two text spans -- requires deep semantic understanding. Recent studies have shown that zero- or few-shot approaches significantly lag behind supervised models, but LLMs may be useful for synthetic data augmentation, where LLMs generate a second argument following a specified coherence relation. We applied this approach in a cross-domain setting, generating discourse continuations using unlabelled target-domain data to adapt a base model which was trained on source-domain labelled data. Evaluations conducted on a large-scale test set revealed that different variations of the approach did not result in any significant improvements. We conclude that LLMs often fail to generate useful samples for IDRR, and emphasize the importance of considering both statistical significance and comparability when evaluating IDRR models.


Deep Learning for Forensic Identification of Source

arXiv.org Machine Learning

We used contrastive neural networks to learn useful similarity scores between the 144 cartridge casings in the NBIDE dataset, under the common-but-unknown source paradigm. The common-but-unknown source problem is a problem archetype in forensics where the question is whether two objects share a common source (e.g. were two cartridge casings fired from the same firearm). Similarity scores are often used to interpret evidence under this paradigm. We directly compared our results to a state-of-the-art algorithm, Congruent Matching Cells (CMC). When trained on the E3 dataset of 2967 cartridge casings, contrastive learning achieved an ROC AUC of 0.892. The CMC algorithm achieved 0.867. We also conducted an ablation study where we varied the neural network architecture; specifically, the network's width or depth. The ablation study showed that contrastive network performance results are somewhat robust to the network architecture. This work was in part motivated by the use of similarity scores attained via contrastive learning for standard evidence interpretation methods such as score-based likelihood ratios.


Why the world is looking to ditch US AI models

MIT Technology Review

As a result, some policymakers and business leaders--in Europe, in particular--are reconsidering their reliance on US-based tech and asking whether they can quickly spin up better, homegrown alternatives. This is particularly true for AI. One of the clearest examples of this is in social media. Yasmin Curzi, a Brazilian law professor who researches domestic tech policy, put it to me this way: "Since Trump's second administration, we cannot count on [American social media platforms] to do even the bare minimum anymore." Social media content moderation systems--which already use automation and are also experimenting with deploying large language models to flag problematic posts--are failing to detect gender-based violence in places as varied as India, South Africa, and Brazil.


'No consent': Australian authors 'livid' that Meta may have used their books to train AI

The Guardian

Australian authors say they are "livid" and feel violated that their work was included in an allegedly pirated dataset of books Meta used to train its AI. In court filings in January it was alleged chief executive Mark Zuckerberg approved the use of the LibGen dataset โ€“ an online archive of books โ€“ to train the company's artificial intelligence models despite warnings from his AI executive team that it is a dataset "we know to be pirated". The Atlantic has published a searchable database where authors can type in their name to see what of their work is included in LibGen dataset. It includes books published by many Australian authors, including some by former prime ministers Malcolm Turnbull, Kevin Rudd, Julia Gillard and John Howard. Holden Sheppard, the author of Invisible Boys, a hit young adult novel that has been adapted into a series on Stan, said two of his books and two short stories were included.


Open Deep Search: Democratizing Search with Open-source Reasoning Agents

arXiv.org Artificial Intelligence

We introduce Open Deep Search (ODS) to close the increasing gap between the proprietary search AI solutions, such as Perplexity's Sonar Reasoning Pro and OpenAI's GPT-4o Search Preview, and their open-source counterparts. The main innovation introduced in ODS is to augment the reasoning capabilities of the latest open-source LLMs with reasoning agents that can judiciously use web search tools to answer queries. Concretely, ODS consists of two components that work with a base LLM chosen by the user: Open Search Tool and Open Reasoning Agent. Open Reasoning Agent interprets the given task and completes it by orchestrating a sequence of actions that includes calling tools, one of which is the Open Search Tool. Open Search Tool is a novel web search tool that outperforms proprietary counterparts. Together with powerful open-source reasoning LLMs, such as DeepSeek-R1, ODS nearly matches and sometimes surpasses the existing state-of-the-art baselines on two benchmarks: SimpleQA and FRAMES. For example, on the FRAMES evaluation benchmark, ODS improves the best existing baseline of the recently released GPT-4o Search Preview by 9.7% in accuracy. ODS is a general framework for seamlessly augmenting any LLMs -- for example, DeepSeek-R1 that achieves 82.4% on SimpleQA and 30.1% on FRAMES -- with search and reasoning capabilities to achieve state-of-the-art performance: 88.3% on SimpleQA and 75.3% on FRAMES.


AI Identity, Empowerment, and Mindfulness in Mitigating Unethical AI Use

arXiv.org Artificial Intelligence

Emerging artificial intelligence (AI) technology has a pronounced impact on higher education, addressing existing challenges in educational settings such as larger school sizes and the scarcity of elite instructors. In all these areas, it has been noted th at AI has led to massive changes: some estimates suggest that at least 80 percent of workers will have the quantity and quality of at least some of their tasks influenced (for the better) by AI (Canagasuriam & Lukacik, 2024) . This means that, in educational contexts, psychological empowerment has been shown to mitigate the combined enullects of emotional exhaustion and depression, demonstrating that social relationships and leadership can bolster mental health in institutions (Schermuly & Meyer, 2016) . However, this is not to say that AI is without dangers; cybercriminals have also turned to AI to bolster their attacks, for example, in the form of spear phishing or malware installation, showcasing how AI can be abused as a tool to harm enterprises (Mirsky et al., 2023) . Psychological empowerment -- comprising meaning, competence, self - determination, and impact -- has strong enullects on person - environment interactions, which ultimately influence how individuals feel about and perform their jobs (Gregory et al., 2010) .


Guarding against artificial intelligence--hallucinated citations: the case for full-text reference deposit

arXiv.org Artificial Intelligence

The tendency of generative artificial intelligence (AI) sys tems to "hallucinate" false information is well-known; AI-generated cit ations to nonexistent sources have made their way into the reference list s of peer-reviewed publications. Here, I propose a solution to this pr oblem, taking inspiration from the T ransparency and Openness Promotion ( TOP) data sharing guidelines, the clash of generative AI with the Amer ican judiciary, and the precedent set by submissions of prior art to the Unite d States Patent and T rademark Office. Journals should require authors to sub mit the full text of each cited source along with their manuscripts, ther eby preventing authors from citing any material whose full text they cannot produce. This solution requires limited additional work on the part of aut hors or editors while effectively immunizing journals against hallucinat ed references. Within the same month, commenters on Pub-Peer raised concerns regarding the article's reference list.