Goto

Collaborating Authors

 encoder


Removing Concepts from Text-to-Image Models with Only Negative Samples

Neural Information Processing Systems

This work introduces Clipout, a method for removing a target concept in pretrained text-to-image models. By randomly clipping units from the learned data embedding and using a contrastive objective, models are encouraged to differentiate these clipped embedding vectors.


Obliviator Reveals the Cost of Nonlinear Guardedness in Concept Erasure

Neural Information Processing Systems

Concept erasure aims to remove unwanted attributes, such as social or demographic factors, from learned representations, while preserving their task-relevant utility. While the goal of concept erasure is protection against all adversaries, existing methods remain vulnerable to nonlinear ones. This vulnerability arises from their failure to fully capture the complex, nonlinear statistical dependencies between learned representations and unwanted attributes. Moreover, although the existence of a trade-off between utility and erasure is expected, its progression during the erasure process, i.e., the cost of erasure, remains unstudied. In this work, we introduce Obliviator, a post-hoc erasure method designed to fully capture nonlinear statistical dependencies.


scGeneScope: ATreatment-Matched Single Cell Imaging and Transcriptomics Dataset and Benchmark for Treatment Response Modeling

Neural Information Processing Systems

Understanding cellular responses to chemical interventions is critical to the discovery of effective therapeutics. Because individual biological techniques often measure only one axis of cellular response at a time, high-quality multimodal datasets are needed to unlock a holistic understanding of how cells respond to treatments and to advance computational methods that integrate modalities. However, many techniques destroy cells and thus preclude paired measurements, and attempts to match disparate unimodal datasets are often confounded by data being generated in incompatible experimental settings. Here we introduce scGeneScope, a multimodal single-cell RNA sequencing (scRNA-seq) and Cell Painting microscopy image dataset conditionally paired by chemical treatment, designed to facilitate the development and benchmarking of unimodal, multimodal, and multiple profile machine learning methods for cellular profiling.


TimeXL: Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop

Neural Information Processing Systems

Time series analysis provides essential insights for real-world system dynamics and informs downstream decision-making, yet most existing methods often overlook the rich contextual signals present in auxiliary modalities. To bridge this gap, we introduce TimeXL, a multi-modal prediction framework that integrates a prototypebased time series encoder with three collaborating Large Language Models (LLMs) to deliver more accurate predictions and interpretable explanations. First, a multimodal prototype-based encoder processes both time series and textual inputs to generate preliminary forecasts alongside case-based rationales. These outputs then feed into a prediction LLM, which refines the forecasts by reasoning over the encoder's predictions and explanations. Next, a reflection LLM compares the predicted values against the ground truth, identifying textual inconsistencies or noise. Guided by this feedback, a refinement LLM iteratively enhances text quality and triggers encoder retraining. This closed-loop workflow--prediction, critique (reflect), and refinement--continuously boosts the framework's performance and interpretability. Empirical evaluations on four real-world datasets demonstrate that TimeXL achieves up to 8.9% improvement in AUC and produces human-centric, multi-modal explanations, highlighting the power of LLM-driven reasoning for time series prediction.


Self-supervised Blending Structural Context of Visual Molecules for Robust Drug Interaction Prediction

Neural Information Processing Systems

Identifying drug-drug interactions (DDIs) is critical for ensuring drug safety and advancing drug development, a topic that has garnered significant research interest. While existing methods have made considerable progress, approaches relying solely on known DDIs face a key challenge when applied to drugs with limited data (e.g., novel and few-shot drugs): insufficient exploration of the space of unlabeled pairwise drugs. To address these issues, we innovatively introduce S2VM, a Selfsupervised Visual pretraining framework for pair-wise Molecules, to fully fuse structural representations and explore the space of drug pairs for DDI prediction. S2VM incorporates the explicit structure and correlations of visual molecules, such as the positional relationships and connectivity between functional substructures. Specifically, we blend the visual fragments of drug pairs into a unified input for joint encoding and then recover molecule-specific visual information for each drug individually.


1 Appendix 2 AMore Details

Neural Information Processing Systems

Score 0 4 (normal) is most common across cohorts, while score 3 (severe) is rare--especially in PD-GaM 5 and 3DGait, highlighting class imbalance challenges. BMCLab offers a balanced ON/OFF medication split, 7 while E-LC is skewed toward ON-medication. DNE includes healthy, Parkinsonian, and other disease 8 groups for broader contrastive training. Figure A.3 shows label distributions for FoG-related cohorts. This artifact likely stems from the unusual top-down perspective--different from the front15 facing or side views seen in WHAM's training data [1]. While motion encoder-based models may be 16 robust to such distortions, feature-based gait classifiers rely on precise kinematic measurements and 17 thus require carefully corrected input data. To correct this slope artifact, we perform a frame-wise 18 rigid alignment of the reconstructed SMPL skeleton using the Kabsch algorithm [2]. The goal is to 19 rotate each frame so that anatomical directions align with canonical coordinate axes (up, forward), 20 while preserving natural gait structure. This motion 28 vector is then projected onto the ground plane (xz-plane) and used as the walking axis. In frames where the sacrum displacement is less than 30 4mm--indicating near-stationary posture--we fall back on a proxy direction: the cross product of 31 the hip vector (left hip to right hip) and the vertical vector.


CARE-PD: AMulti-Site Anonymized Clinical Dataset for Parkinson's Disease Gait Assessment

Neural Information Processing Systems

Objective gait assessment in Parkinson's Disease (PD) is limited by the absence of large, diverse, and clinically annotated motion datasets. We introduce CARE-PD, the largest publicly available archive of 3D mesh gait data for PD, and the first multi-site collection spanning 9 cohorts from 8 clinical centers. All recordings (RGB video or motion capture) are converted into anonymized SMPL meshes via a harmonized preprocessing pipeline. CARE-PD supports two key benchmarks: supervised clinical score prediction (estimating Unified Parkinson's Disease Rating Scale, UPDRS, gait scores) and unsupervised motion pretext tasks (2D-to-3D keypoint lifting and full-body 3D reconstruction). Clinical prediction is evaluated under four generalization protocols: within-dataset, cross-dataset, leave-one-dataset-out, and multi-dataset in-domain adaptation. To assess clinical relevance, we compare state-of-the-art motion encoders with a traditional gait-feature baseline, finding that encoders consistently outperform handcrafted features. Pretraining on CARE-PD reduces MPJPE (from 60.8 mm to 7.5 mm) and boosts PD severity macro-F1 by 17 percentage points, underscoring the value of clinically curated, diverse training data. CARE-PD and all benchmark code are released for non-commercial research at https://neurips2025.care-pd.ca.


ImgEdit: AUnified Image Editing Dataset and Benchmark

Neural Information Processing Systems

Recent advancements in generative models have enabled high-fidelity text-to-image generation. However, open-source image-editing models still lag behind their proprietary counterparts, primarily due to limited high-quality data and insufficient benchmarks. To overcome these limitations, we introduce ImgEdit, a largescale, high-quality image-editing dataset comprising one million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks. To ensure the data quality, we employ a multi-stage pipeline that integrates a cutting-edge vision-language model, a detection model, a segmentation model, alongside task-specific in-painting procedures and strict postprocessing.


AStatistical Theory of Contrastive Learning via Approximate Sufficient Statistics

Neural Information Processing Systems

Contrastive learning--a modern approach to extract useful representations from unlabeled data by training models to distinguish similar samples from dissimilar ones--has driven significant progress in foundation models. In this work, we develop a new theoretical framework for analyzing data augmentation-based contrastive learning, with a focus on SimCLR as a representative example. Our approach is based on the concept of approximate sufficient statistics, which we extend beyond its original definition in Oko et al. [28] for contrastive languageimage pretraining (CLIP) using KL-divergence. We generalize it to equivalent forms and general f-divergences, and show that minimizing SimCLR and other contrastive losses yields encoders that are approximately sufficient. Furthermore, we demonstrate that these near-sufficient encoders can be effectively adapted to downstream regression and classification tasks, with performance depending on their sufficiency and the error induced by data augmentation in contrastive learning. Concrete examples in linear regression and topic classification are provided to illustrate the broad applicability of our results.


CellCLIP - Learning Perturbation Effects in Cell Painting via Text-Guided Contrastive Learning

Neural Information Processing Systems

High-content screening (HCS) assays based on high-throughput microscopy techniques such as Cell Painting have enabled the interrogation of cells' morphological responses to perturbations at an unprecedented scale. The collection of such data promises to facilitate a better understanding of the relationships between different perturbations and their effects on cellular state. Towards achieving this goal, recent advances in cross-modal contrastive learning could, in theory, be leveraged to learn a unified latent space that aligns perturbations with their corresponding morphological effects. However, the application of such methods to HCS data is not straightforward due to substantial differences in the semantics of Cell Painting images compared to natural images, and the difficulty of representing different classes of perturbations (e.g.