Law
Security-First AI: Foundations for Robust and Trustworthy Systems
The conversation around artificial intelligence (AI) often focuses on safety, transparency, accountability, alignment, and responsibility. However, AI security (i.e., the safeguarding of data, models, and pipelines from adversarial manipulation) underpins all of these efforts. This manuscript posits that AI security must be prioritized as a foundational layer. We present a hierarchical view of AI challenges, distinguishing security from safety, and argue for a security-first approach to enable trustworthy and resilient AI systems. We discuss core threat models, key attack vectors, and emerging defense mechanisms, concluding that a metric-driven approach to AI security is essential for robust AI safety, transparency, and accountability.
Certified Mitigation of Worst-Case LLM Copyright Infringement
Zhang, Jingyu, Yu, Jiacan, Marone, Marc, Van Durme, Benjamin, Khashabi, Daniel
The exposure of large language models (LLMs) to copyrighted material during pre-training raises concerns about unintentional copyright infringement post deployment. This has driven the development of "copyright takedown" methods, post-training approaches aimed at preventing models from generating content substantially similar to copyrighted ones. While current mitigation approaches are somewhat effective for average-case risks, we demonstrate that they overlook worst-case copyright risks exhibits by the existence of long, verbatim quotes from copyrighted sources. We propose BloomScrub, a remarkably simple yet highly effective inference-time approach that provides certified copyright takedown. Our method repeatedly interleaves quote detection with rewriting techniques to transform potentially infringing segments. By leveraging efficient data sketches (Bloom filters), our approach enables scalable copyright screening even for large-scale real-world corpora. When quotes beyond a length threshold cannot be removed, the system can abstain from responding, offering certified risk reduction. Experimental results show that BloomScrub reduces infringement risk, preserves utility, and accommodates different levels of enforcement stringency with adaptive abstention. Our results suggest that lightweight, inference-time methods can be surprisingly effective for copyright prevention.
aiXamine: Simplified LLM Safety and Security
Deniz, Fatih, Popovic, Dorde, Boshmaf, Yazan, Jeong, Euisuh, Ahmad, Minhaj, Chawla, Sanjay, Khalil, Issa
Evaluating Large Language Models (LLMs) for safety and security remains a complex task, often requiring users to navigate a fragmented landscape of ad hoc benchmarks, datasets, metrics, and reporting formats. To address this challenge, we present aiXamine, a comprehensive black-box evaluation platform for LLM safety and security. aiXamine integrates over 40 tests (i.e., benchmarks) organized into eight key services targeting specific dimensions of safety and security: adversarial robustness, code security, fairness and bias, hallucination, model and data privacy, out-of-distribution (OOD) robustness, over-refusal, and safety alignment. The platform aggregates the evaluation results into a single detailed report per model, providing a detailed breakdown of model performance, test examples, and rich visualizations. We used aiXamine to assess over 50 publicly available and proprietary LLMs, conducting over 2K examinations. Our findings reveal notable vulnerabilities in leading models, including susceptibility to adversarial attacks in OpenAI's GPT-4o, biased outputs in xAI's Grok-3, and privacy weaknesses in Google's Gemini 2.0. Additionally, we observe that open-source models can match or exceed proprietary models in specific services such as safety alignment, fairness and bias, and OOD robustness. Finally, we identify trade-offs between distillation strategies, model size, training methods, and architectural choices.
Sustainability via LLM Right-sizing
Haase, Jennifer, Klessascheck, Finn, Mendling, Jan, Pokutta, Sebastian
Large language models (LLMs) have become increasingly embedded in organizational workflows. This has raised concerns over their energy consumption, financial costs, and data sovereignty. While performance benchmarks often celebrate cutting-edge models, real-world deployment decisions require a broader perspective: when is a smaller, locally deployable model "good enough"? This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks, including summarizing texts, generating schedules, and drafting emails and proposals. Using a dual-LLM-based evaluation framework, we automated task execution and standardized evaluation across ten criteria related to output quality, factual accuracy, and ethical responsibility. Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint. Notably, smaller models like Gemma-3 and Phi-4 achieved strong and reliable results on most tasks, suggesting their viability in contexts requiring cost-efficiency, local deployment, or privacy. A cluster analysis revealed three model groups -- premium all-rounders, competent generalists, and limited but safe performers -- highlighting trade-offs between quality, control, and sustainability. Significantly, task type influenced model effectiveness: conceptual tasks challenged most models, while aggregation and transformation tasks yielded better performances. We argue for a shift from performance-maximizing benchmarks to task- and context-aware sufficiency assessments that better reflect organizational priorities. Our approach contributes a scalable method to evaluate AI models through a sustainability lens and offers actionable guidance for responsible LLM deployment in practice.
Google paid Samsung to preload and integrate Gemini AI on phones
If you're using an Android phone, you've probably noticed that Google's Gemini AI assistant seems to be popping up everywhere, the same way it's been popping into Google Search, Docs, YouTube, etc. And this is true even if you aren't using a Google-branded phone. Turns out, that's no accident because Google is paying Samsung loads of money to make sure Gemini is front and center on its phones. The information comes from a predictable source: testimony in the ongoing and potentially disastrous Google antitrust case. Google has lost two separate antitrust cases brought by the US federal government in the last year.)
ChatGPT maker OpenAI wants to buy Chrome from Google
Google is having a bit of a moment. It's not quite an Enron- or FTX-style "abandon ship" situation, but between two separate US antitrust rulings on its core search and advertising businesses, it's a five-alarm fire. One of the possible outcomes is Google selling off the Chrome browser… and it looks like one possible buyer is OpenAI, maker of ChatGPT. OpenAI's head of product for ChatGPT is named Nick Turley, and he testified at the remedy phase of the Department of Justice's successful monopoly suit against Google. When asked if OpenAI would be interested in buying the Chrome browser from Google, Turley didn't mince words.
State Bar of California admits it used AI to develop exam questions, triggering new furor
Nearly two months after hundreds of prospective California lawyers complained that their bar exams were plagued with technical problems and irregularities, the state's legal licensing body has caused fresh outrage by admitting that some multiple-choice questions were developed with the aid of artificial intelligence. The State Bar of California said in a news release Monday that it will ask the California Supreme Court to adjust test scores for those who took its February bar exam. But it declined to acknowledge significant problems with its multiple-choice questions -- even as it revealed that a subset of questions were recycled from a first-year law student exam, while others were developed with the assistance of AI by ACS Ventures, the State Bar's independent psychometrician. "The debacle that was the February 2025 bar exam is worse than we imagined," said Mary Basick, assistant dean of academic skills at UC Irvine Law School. Having the questions drafted by non-lawyers using ...
Whence Is A Model Fair? Fixing Fairness Bugs via Propensity Score Matching
Peng, Kewen, Yang, Yicheng, Zhuo, Hao, Menzies, Tim
Fairness-aware learning aims to mitigate discrimination against specific protected social groups (e.g., those categorized by gender, ethnicity, age) while minimizing predictive performance loss. Despite efforts to improve fairness in machine learning, prior studies have shown that many models remain unfair when measured against various fairness metrics. In this paper, we examine whether the way training and testing data are sampled affects the reliability of reported fairness metrics. Since training and test sets are often randomly sampled from the same population, bias present in the training data may still exist in the test data, potentially skewing fairness assessments. To address this, we propose FairMatch, a post-processing method that applies propensity score matching to evaluate and mitigate bias. FairMatch identifies control and treatment pairs with similar propensity scores in the test set and adjusts decision thresholds for different subgroups accordingly. For samples that cannot be matched, we perform probabilistic calibration using fairness-aware loss functions. Experimental results demonstrate that our approach can (a) precisely locate subsets of the test data where the model is unbiased, and (b) significantly reduce bias on the remaining data. Overall, propensity score matching offers a principled way to improve both fairness evaluation and mitigation, without sacrificing predictive performance.
FinTextSim: Enhancing Financial Text Analysis with BERTopic
Jehnen, Simon, Ordieres-Meré, Joaquín, Villalba-Díez, Javier
Recent advancements in information availability and computational capabilities have transformed the analysis of annual reports, integrating traditional financial metrics with insights from textual data. To extract valuable insights from this wealth of textual data, automated review processes, such as topic modeling, are crucial. This study examines the effectiveness of BERTopic, a state-of-the-art topic model relying on contextual embeddings, for analyzing Item 7 and Item 7A of 10-K filings from S&P 500 companies (2016-2022). Moreover, we introduce FinTextSim, a finetuned sentence-transformer model optimized for clustering and semantic search in financial contexts. Compared to all-MiniLM-L6-v2, the most widely used sentence-transformer, FinTextSim increases intratopic similarity by 81% and reduces intertopic similarity by 100%, significantly enhancing organizational clarity. We assess BERTopic's performance using embeddings from both FinTextSim and all-MiniLM-L6-v2. Our findings reveal that BERTopic only forms clear and distinct economic topic clusters when paired with FinTextSim's embeddings. Without FinTextSim, BERTopic struggles with misclassification and overlapping topics. Thus, FinTextSim is pivotal for advancing financial text analysis. FinTextSim's enhanced contextual embeddings, tailored for the financial domain, elevate the quality of future research and financial information. This improved quality of financial information will enable stakeholders to gain a competitive advantage, streamlining resource allocation and decision-making processes. Moreover, the improved insights have the potential to leverage business valuation and stock price prediction models.
Interpretable Deep Learning for Polar Mechanistic Reaction Prediction
Miller, Ryan J., Dashuta, Alexander E., Rudisill, Brayden, Van Vranken, David, Baldi, Pierre
Accurately predicting chemical reactions is essential for driving innovation in synthetic chemistry, with broad applications in medicine, manufacturing, and agriculture. At the same time, reaction prediction is a complex problem which can be both time-consuming and resource-intensive for chemists to solve. Deep learning methods offer an appealing solution by enabling high-throughput reaction prediction. However, many existing models are trained on the US Patent Office dataset and treat reactions as overall transformations: mapping reactants directly to products with limited interpretability or mechanistic insight. To address this, we introduce PMechRP (Polar Mechanistic Reaction Predictor), a system that trains machine learning models on the PMechDB dataset, which represents reactions as polar elementary steps that capture electron flow and mechanistic detail. To further expand model coverage and improve generalization, we augment PMechDB with a diverse set of combinatorially generated reactions. We train and compare a range of machine learning models, including transformer-based, graph-based, and two-step siamese architectures. Our best-performing approach was a hybrid model, which combines a 5-ensemble of Chemformer models with a two-step Siamese framework to leverage the accuracy of transformer architectures, while filtering away "alchemical" products using the two-step network predictions. For evaluation, we use a test split of the PMechDB dataset and additionally curate a human benchmark dataset consisting of complete mechanistic pathways extracted from an organic chemistry textbook. Our hybrid model achieves a top-10 accuracy of 94.9% on the PMechDB test set and a target recovery rate of 84.9% on the pathway dataset.