Industry
Unveiling the Uncertainty in Embodied and Operational Carbon of Large AIModels through a Probabilistic Carbon Accounting Model
The rapid growth of large AI models has raised significant environmental concerns due to their substantial carbon footprint. Existing carbon accounting methods for AI models are fundamentally deterministic and fail to account for inherent uncertainties in embodied and operational carbon emissions. Our work aims to investigate the effect of these uncertainties on embodied and operational carbon footprint estimates for large AI models. We propose a Probabilistic Carbon Accounting Model (PCAM), which quantifies uncertainties in the carbon accounting of large AI models. We develop parameter models to quantify key components (processors, memory, storage) in the carbon footprint of AI models. To characterize the distribution of the parameters, we develop a carbon dataset by aggregating related data from various sources. Then, we generate the probabilistic distribution of the parameters from the collected dataset. We compare the performance of PCAM with LLMCarbon, the state-of-the-art carbon accounting method for large AI models.
Scalable, Explainable and Provably Robust Anomaly Detection with One-Step Flow Matching
We introduce Time-Conditioned Contraction Matching (TCCM), a novel method for semi-supervised anomaly detection in tabular data. TCCM is inspired by flow matching, a recent generative modeling framework that learns velocity fields between probability distributions and has shown strong performance compared to diffusion models and generative adversarial networks. Instead of directly applying flow matching as originally formulated, TCCM builds on its core idea--learning velocity fields between distributions--but simplifies the framework by predicting a time-conditioned contraction vector toward a fixed target (the origin) at each sampled time step. This design offers three key advantages: (1) a lightweight and scalable training objective that removes the need for solving ordinary differential equations during training and inference; (2) an efficient scoring strategy called one time-step deviation, which quantifies deviation from expected contraction behavior in a single forward pass, addressing the inference bottleneck of existing continuous-time models such as DTE (a diffusion-based model with leading anomaly detection accuracy but heavy inference cost); and (3) explainability and provable robustness, as the learned velocity field operates directly in input space, making the anomaly score inherently feature-wise attributable; moreover, the score function is Lipschitz-continuous with respect to the input, providing theoretical guarantees under small perturbations. Extensive experiments on the ADBench benchmark show that TCCM strikes a favorable balance between detection accuracy and inference cost, outperforming state-of-the-art methods--especially on high-dimensional and large-scale datasets.
Generating Computational Cognitive Models using Large Language Models
Computational cognitive models, which formalize theories of cognition, enable researchers to quantify cognitive processes and arbitrate between competing theories by fitting models to behavioral data. Traditionally, these models are handcrafted, which requires significant domain knowledge, coding expertise, and time investment.
Exploring the limits of strong membership inference attacks on large language models
State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training references (e.g., fine-tuning attacks), or on stronger attacks applied to small models and datasets. However, weaker attacks have been shown to be brittle and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges prompt an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA--one of the strongest MIAs--to GPT-2 architectures ranging from 10M to 1B parameters, training references on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in four key ways. While (1) strong MIAs can succeed on pretrained LLMs, (2) their effectiveness, remains limited (e.g., AUC<0.7) in practical settings.
Predicting Functional Brain Connectivity with Context-Aware Deep Neural Networks
Spatial location and molecular interactions have long been linked to the connectivity patterns of neural circuits. Yet, at the macroscale of human brain networks, the interplay between spatial position, gene expression, and connectivity remains incompletely understood. Recent efforts to map the human transcriptome and connectome have yielded spatially resolved brain atlases, however modeling the relationship between high-dimensional transcriptomic data and connectivity while accounting for inherent spatial confounds presents a significant challenge. In this paper, we present the first deep learning approaches for predicting whole-brain functional connectivity from gene expression and regional spatial coordinates, including our proposed Spatiomolecular Transformer (SMT). SMT explicitly models biological context by tokenizing genes based on their transcription start site (TSS) order to capture multi-scale genomic organization, and incorporating regional 3D spatial location via a dedicated context [CLS] token within its multi-head self-attention mechanism. We rigorously benchmark context-aware neural networks, including SMT and a single-gene resolution Multilayer-Perceptron (MLP), to established rules-based and bilinear methods. Crucially, to ensure that learned relationships in any model are not mere artifacts of spatial proximity, we introduce novel spatiomolecular null maps preserving key transcriptomic autocorrelation structure. Context-aware neural networks outperform linear methods, significantly exceed our stringent null map estimates, and generalize across diverse connectomic datasets and parcellation resolutions. Together, these findings demonstrate a strong, predictable link between the spatial distributions of gene expression and functional brain network architecture, and establish a rigorously validated deep learning framework for decoding this relationship.
Differentiable Hierarchical Visual Tokenization
Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.
Modeling Neural Activity with Conditionally Linear Dynamical Systems
Neural population activity exhibits complex, nonlinear dynamics, varying in time, over trials, and across experimental conditions. Here, we develop Conditionally Linear Dynamical System (CLDS) models as a general-purpose method to characterize these dynamics. These models use Gaussian Process (GP) priors to capture the nonlinear dependence of circuit dynamics on task and behavioral variables. Conditioned on these covariates, the data is modeled with linear dynamics. This allows for transparent interpretation and tractable Bayesian inference. We find that CLDS models can perform well even in severely data-limited regimes (e.g. one trial per condition) due to their Bayesian formulation and ability to share statistical power across nearby task conditions. In example applications, we apply CLDS to model thalamic neurons that nonlinearly encode heading direction and to model motor cortical neurons during a cued reaching task.
Direct Fisher Score Estimation for Likelihood Maximization
We study the problem of likelihood maximization when the likelihood function is intractable but model simulations are readily available. We propose a sequential, gradient-based optimization method that directly models the Fisher score based on a local score matching technique which uses simulations from a localized region around each parameter iterate. By employing a linear parameterization for the surrogate score model, our technique admits a closed-form, least-squares solution. This approach yields a fast, flexible, and efficient approximation to the Fisher score, effectively smoothing the likelihood objective and mitigating the challenges posed by complex likelihood landscapes. We provide theoretical guarantees for our score estimator, including bounds on the bias introduced by the smoothing. Empirical results on a range of synthetic and real-world problems demonstrate the superior performance of our method compared to existing benchmarks.
CXReasonBench: ABenchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 12 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation.