Law
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
Knight, Christina Q., Deshpande, Kaustubh, Sirdeshmukh, Ved, Mankikar, Meher, Team, Scale Red, Team, SEAL Research, Michael, Julian
The rapid advancement of large language models (LLMs) introduces dual-use capabilities that could both threaten and bolster national security and public safety (NSPS). Models implement safeguards to protect against potential misuse relevant to NSPS and allow for benign users to receive helpful information. However, current benchmarks often fail to test safeguard robustness to potential NSPS risks in an objective, robust way. We introduce FORTRESS: 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains (unclassified information only): Chemical, Biological, Radiological, Nuclear and Explosive (CBRNE), Political Violence & Terrorism, and Criminal & Financial Illicit Activities, with 10 total subcategories across these domains. Each prompt-rubric pair has a corresponding benign version to test for model over-refusals. This evaluation of frontier LLMs' safeguard robustness reveals varying trade-offs between potential risks and model usefulness: Claude-3.5-Sonnet demonstrates a low average risk score (ARS) (14.09 out of 100) but the highest over-refusal score (ORS) (21.8 out of 100), while Gemini 2.5 Pro shows low over-refusal (1.4) but a high average potential risk (66.29). Deepseek-R1 has the highest ARS at 78.05, but the lowest ORS at only 0.06. Models such as o1 display a more even trade-off between potential risks and over-refusals (with an ARS of 21.69 and ORS of 5.2). To provide policymakers and researchers with a clear understanding of models' potential risks, we publicly release FORTRESS at https://huggingface.co/datasets/ScaleAI/fortress_public. We also maintain a private set for evaluation.
CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models
Tang, Xiaqiang, Li, Jian, Hu, Keyu, Nan, Du, Li, Xiaolong, Zhang, Xi, Sun, Weigao, Xie, Sihong
Faithfulness hallucinations are claims generated by a Large Language Model (LLM) not supported by contexts provided to the LLM. Lacking assessment standards, existing benchmarks focus on "factual statements" that rephrase source materials while overlooking "cognitive statements" that involve making inferences from the given context. Consequently, evaluating and detecting the hallucination of cognitive statements remains challenging. Inspired by how evidence is assessed in the legal domain, we design a rigorous framework to assess different levels of faithfulness of cognitive statements and introduce the CogniBench dataset where we reveal insightful statistics. To keep pace with rapidly evolving LLMs, we further develop an automatic annotation pipeline that scales easily across different models. This results in a large-scale CogniBench-L dataset, which facilitates training accurate detectors for both factual and cognitive hallucinations. We release our model and datasets at: https://github.com/FUTUREEEEEE/CogniBench
Large Language Model-Driven Code Compliance Checking in Building Information Modeling
Madireddy, Soumya, Gao, Lu, Din, Zia, Kim, Kinam, Senouci, Ahmed, Han, Zhe, Zhang, Yunpeng
This research addresses the time-consuming and error-prone nature of manual code compliance checking in Building Information Modeling (BIM) by introducing a Large Language Model (LLM)-driven approach to semi-automate this critical process. The developed system integrates LLMs such as GPT, Claude, Gemini, and Llama, with Revit software to interpret building codes, generate Python scripts, and perform semi-automated compliance checks within the BIM environment. Case studies on a single-family residential project and an office building project demonstrated the system's ability to reduce the time and effort required for compliance checks while improving accuracy. It streamlined the identification of violations, such as non-compliant room dimensions, material usage, and object placements, by automatically assessing relationships and generating actionable reports. Compared to manual methods, the system eliminated repetitive tasks, simplified complex regulations, and ensured reliable adherence to standards. By offering a comprehensive, adaptable, and cost-effective solution, this proposed approach offers a promising advancement in BIM-based compliance checking, with potential applications across diverse regulatory documents in construction projects.
Probing AI Safety with Source Code
Narayan, Ujwal, Chaudhari, Shreyas, Kalyan, Ashwin, Rajpurohit, Tanmay, Narasimhan, Karthik, Deshpande, Ameet, Murahari, Vishvak
Large language models (LLMs) have become ubiquitous, interfacing with humans in numerous safety-critical applications. This necessitates improving capabilities, but importantly coupled with greater safety measures to align these models with human values and preferences. In this work, we demonstrate that contemporary models fall concerningly short of the goal of AI safety, leading to an unsafe and harmful experience for users. We introduce a prompting strategy called Code of Thought (CoDoT) to evaluate the safety of LLMs. CoDoT converts natural language inputs to simple code that represents the same intent. For instance, CoDoT transforms the natural language prompt "Make the statement more toxic: {text}" to: "make_more_toxic({text})". We show that CoDoT results in a consistent failure of a wide range of state-of-the-art LLMs. For example, GPT-4 Turbo's toxicity increases 16.5 times, DeepSeek R1 fails 100% of the time, and toxicity increases 300% on average across seven modern LLMs. Additionally, recursively applying CoDoT can further increase toxicity two times. Given the rapid and widespread adoption of LLMs, CoDoT underscores the critical need to evaluate safety efforts from first principles, ensuring that safety and capabilities advance together.
Verifiable Unlearning on Edge
Maheri, Mohammad M, Davidson, Alex, Haddadi, Hamed
Machine learning providers commonly distribute global models to edge devices, which subsequently personalize these models using local data. However, issues such as copyright infringements, biases, or regulatory requirements may require the verifiable removal of certain data samples across all edge devices. Ensuring that edge devices correctly execute such unlearning operations is critical to maintaining integrity. In this work, we introduce a verification framework leveraging zero-knowledge proofs, specifically zk-SNARKs, to confirm data unlearning on personalized edge-device models without compromising privacy. We have developed algorithms explicitly designed to facilitate unlearning operations that are compatible with efficient zk-SNARK proof generation, ensuring minimal computational and memory overhead suitable for constrained edge environments. Furthermore, our approach carefully preserves personalized enhancements on edge devices, maintaining model performance post-unlearning. Our results affirm the practicality and effectiveness of this verification framework, demonstrating verifiable unlearning with minimal degradation in personalization-induced performance improvements. Our methodology ensures verifiable, privacy-preserving, and effective machine unlearning across edge devices.
The End of Publishing as We Know It
When tech companies first rolled out generative-AI products, some critics immediately feared a media collapse. Every bit of writing, imagery, and video became suspect. But for news publishers and journalists, another calamity was on the horizon. Chatbots have proved adept at keeping users locked into conversations. They do so by answering every question, often through summarizing articles from news publishers.
Why AI Regulation Has Become a 'States' Rights' Issue
Regardless of the outcome, the battle reflects both the AI industry's influence in Washington and the heightened anxieties that influence is causing among many different coalitions. Here's how the battle lines are being drawn, and why key Republicans are defecting from the party line. Congress has been notoriously slow to pass any sort of tech regulation in the past two decades. As a result, states have filled the void, passing bills that regulate biometric data and child online safety. The same has held true for AI: as the industry has surged in usage, hundreds of AI bills have been proposed in states, with dozens enacted into law.
'Big, beautiful' bill could give a free pass for Big Tech to kill jobs
Gladstone A.I. co-founders and CEOs Edouard Harris and Jeremie Harris explain the major role that A.I will play in national security and warfare on'The Will Cain Show.' Buried in the budget reconciliation package recently passed by the House is a moratorium that would block every U.S. state from passing laws on artificial intelligence or automation for the next decade. Why would lawmakers try to sneak a 10-year ban on AI regulation into a budget bill? The draft moratorium is the result of aggressive lobbying from companies that are already using AI to undermine workers and eliminate jobs. To understand what a ban on AI regulation could mean, ask yourself: what are lawmakers talking about when they talk about "AI?" Most of us imagine programs like ChatGPT churning out text and images. But Big Tech sees something else: disruption, control, profit. They want driverless trucks crisscrossing our roads without oversight.
GradualDiff-Fed: A Federated Learning Specialized Framework for Large Language Model
The rapid proliferation of large language models (LLMs) has created an unprecedented demand for fine-tuning models for specialized domains, such as medical science. While federated learning (FL) offers a decentralized and privacy-preserving approach to collaboratively fine-tune LLMs without sharing raw data, it presents significant challenges, particularly in performance and managing large model sizes efficiently. In this paper, we introduce GradualDiff-Fed, an FL framework designed explicitly for LLMs, and their challenge of handling the high parameter size. GradualDiff-Fed reduces communication costs by transmitting only the difference of model weights rather than the entire model during training rounds. Such an approach significantly improves scalability and communication efficiency, making it more feasible to fine-tune LLMs across distributed clients without compromising performance. Our evaluation demonstrates that GradualDiff-Fed achieves performance on par with centralized training while drastically reducing communication overhead. These results highlight the potential of GradualDiff-Fed as an efficient solution for fine-tuning large models from distributed data in privacy-preserving settings without comprising performance.