Goto

Collaborating Authors

 Generative AI


Identifying Legal Holdings with LLMs: A Systematic Study of Performance, Scale, and Memorization

arXiv.org Artificial Intelligence

As large language models (LLMs) continue to advance in capabilities, it is essential to assess how they perform on established benchmarks. In this study, we present a suite of experiments to assess the performance of modern LLMs (ranging from 3B to 90B+ parameters) on CaseHOLD, a legal benchmark dataset for identifying case holdings. Our experiments demonstrate scaling effects - performance on this task improves with model size, with more capable models like GPT4o and AmazonNovaPro achieving macro F1 scores of 0.744 and 0.720 respectively. These scores are competitive with the best published results on this dataset, and do not require any technically sophisticated model training, fine-tuning or few-shot prompting. To ensure that these strong results are not due to memorization of judicial opinions contained in the training data, we develop and utilize a novel citation anonymization test that preserves semantic meaning while ensuring case names and citations are fictitious. Models maintain strong performance under these conditions (macro F1 of 0.728), suggesting the performance is not due to rote memorization. These findings demonstrate both the promise and current limitations of LLMs for legal tasks with important implications for the development and measurement of automated legal analytics and legal benchmarks.


Single-agent or Multi-agent Systems? Why Not Both?

arXiv.org Artificial Intelligence

Multi-agent systems (MAS) decompose complex tasks and delegate subtasks to different large language model (LLM) agents and tools. Prior studies have reported the superior accuracy performance of MAS across diverse domains, enabled by long-horizon context tracking and error correction through role-specific agents. However, the design and deployment of MAS incur higher complexity and runtime cost compared to single-agent systems (SAS). Meanwhile, frontier LLMs, such as OpenAI-o3 and Gemini-2.5-Pro, have rapidly advanced in long-context reasoning, memory retention, and tool usage, mitigating many limitations that originally motivated MAS designs. In this paper, we conduct an extensive empirical study comparing MAS and SAS across various popular agentic applications. We find that the benefits of MAS over SAS diminish as LLM capabilities improve, and we propose efficient mechanisms to pinpoint the error-prone agent in MAS. Furthermore, the performance discrepancy between MAS and SAS motivates our design of a hybrid agentic paradigm, request cascading between MAS and SAS, to improve both efficiency and capability. Our design improves accuracy by 1.1-12% while reducing deployment costs by up to 20% across various agentic applications.


Towards medical AI misalignment: a preliminary study

arXiv.org Artificial Intelligence

--Despite their staggering capabilities as assistant tools, often exceeding human performances, Large Language Models (LLMs) are still prone to jailbreak attempts from malevolent users. Although red teaming practices have already identified and helped to address several such jailbreak techniques, one particular sturdy approach involving role-playing (which we named'Goofy Game') seems effective against most of the current LLMs safeguards. This can result in the provision of unsafe content, which, although not harmful per se, might lead to dangerous consequences if delivered in a setting such as the medical domain. In this preliminary and exploratory study, we provide an initial analysis of how, even without technical knowledge of the internal architecture and parameters of generative AI models, a malicious user could construct a role-playing prompt capable of coercing an LLM into producing incorrect (and potentially harmful) clinical suggestions. We aim to illustrate a specific vulnerability scenario, providing insights that can support future advancements in the field. Warning: this paper contains examples with unsafe content.


A Geometric View of Data Complexity: Efficient Local Intrinsic Dimension Estimation with Diffusion Models

Neural Information Processing Systems

High-dimensional data commonly lies on low-dimensional submanifolds, and estimating the local intrinsic dimension (LID) of a datum -- i.e. the dimension of the submanifold it belongs to -- is a longstanding problem. LID can be understood as the number of local factors of variation: the more factors of variation a datum has, the more complex it tends to be. Estimating this quantity has proven useful in contexts ranging from generalization in neural networks to detection of out-of-distribution data, adversarial examples, and AI-generated text. The recent successes of deep generative models present an opportunity to leverage them for LID estimation, but current methods based on generative models produce inaccurate estimates, require more than a single pre-trained model, are computationally intensive, or do not exploit the best available deep generative models: diffusion models (DMs). In this work, we show that the Fokker-Planck equation associated with a DM can provide an LID estimator which addresses the aforementioned deficiencies.


Barking up the right tree: an approach to search over molecule synthesis DAGs

Neural Information Processing Systems

When designing new molecules with particular properties, it is not only important what to make but crucially how to make it. These instructions form a synthesis directed acyclic graph (DAG), describing how a large vocabulary of simple building blocks can be recursively combined through chemical reactions to create more complicated molecules of interest. In contrast, many current deep generative models for molecules ignore synthesizability. We therefore propose a deep generative model that better represents the real world process, by directly outputting molecule synthesis DAGs. We argue that this provides sensible inductive biases, ensuring that our model searches over the same chemical space that chemists would also have access to, as well as interoperability.


Estimating the Hallucination Rate of Generative AI

Neural Information Processing Systems

This paper presents a method for estimating the hallucination rate for in-context learning (ICL) with generative AI. In ICL, a conditional generative model (CGM) is prompted with a dataset and a prediction question and asked to generate a response. One interpretation of ICL assumes that the CGM computes the posterior predictive of an unknown Bayesian model, which implicitly defines a joint distribution over observable datasets and latent mechanisms. This joint distribution factorizes into two components: the model prior over mechanisms and the model likelihood of datasets given a mechanism. With this perspective, we define a \textit{hallucination} as a generated response to the prediction question with low model likelihood given the mechanism.


CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

Neural Information Processing Systems

Data selection has emerged as a core issue for large-scale visual-language model pretaining (e.g., CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e.g., CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce \textbf{negCLIPLoss}, a method inspired by CLIP training loss that adds the alignment between one sample and its contrastive pairs as an extra normalization term to CLIPScore for better quality measurement.


Invisible Image Watermarks Are Provably Removable Using Generative AI

Neural Information Processing Systems

They also prevent people from misusing images, especially those generated by AI models.We propose a family of regeneration attacks to remove these invisible watermarks. The proposed attack method first adds random noise to an image to destroy the watermark and then reconstructs the image. This approach is flexible and can be instantiated with many existing image-denoising algorithms and pre-trained generative models such as diffusion models. Through formal proofs and extensive empirical evaluations, we demonstrate that pixel-level invisible watermarks are vulnerable to this regeneration attack.Our results reveal that, across four different pixel-level watermarking schemes, the proposed method consistently achieves superior performance compared to existing attack techniques, with lower detection rates and higher image quality.However, watermarks that keep the image semantically similar can be an alternative defense against our attacks.Our finding underscores the need for a shift in research/industry emphasis from invisible watermarks to semantic-preserving watermarks.


On the Identifiability of Hybrid Deep Generative Models: Meta-Learning as a Solution

Neural Information Processing Systems

The interest in leveraging physics-based inductive bias in deep learning has resulted in recent development of hybrid deep generative models (hybrid-DGMs) that integrates known physics-based mathematical expressions in neural generative models. To identify these hybrid-DGMs requires inferring parameters of the physics-based component along with their neural component. The identifiability of these hybrid-DGMs, however, has not yet been theoretically probed or established. How does the existing theory of the un-identifiability of general DGMs apply to hybrid-DGMs? What may be an effective approach to consutrct a hybrid-DGM with theoretically-proven identifiability?


OpenAI's 6.5B new acquisition signals Apple's biggest AI crisis yet

FOX News

OpenAI CEO Sam Altman sits down with Shannon Bream to discuss the positives and potential negatives of artificial intelligence and the importance of maintaining a lead in the AI industry over China. OpenAI has just made a move that's turning heads across the tech world. The company is acquiring io, the AI device startup founded by Jony Ive, for nearly 6.5 billion. It's a collaboration between Sam Altman, who leads OpenAI, and the designer responsible for some of Apple's most iconic products, including the iPhone and Apple Watch. Together, they want to create a new generation of AI-powered devices that could completely change how we use technology.