Goto

Collaborating Authors

 Technology


Discovering Opinion Intervals from Conflicts in Signed Graphs

Neural Information Processing Systems

Online social media provide a platform for people to discuss current events and exchange opinions with their peers. While interactions are predominantly positive, in recent years, there has been a lot of research to understand the conflicts in social networks and how they are based on different views and opinions. In this paper, we ask whether the conflicts in a network reveal a small and interpretable set of prevalent opinion ranges that explain the users' interactions. More precisely, we consider signed graphs, where the edge signs indicate positive and negative interactions of node pairs, and our goal is to infer opinion intervals that are consistent with the edge signs. We introduce an optimization problem that models this question, and we give strong hardness results and a polynomial-time approximation scheme by utilizing connections to interval graphs and the Correlation Clustering problem. We further provide scalable heuristics and show that in experiments they yield more expressive solutions than Correlation Clustering baselines. We also present a case study on a novel real-world dataset from the German parliament, showing that our algorithms can recover the political leaning of German parties based on co-voting behavior.


Higher-Order Learning with Graph Neural Networks via Hypergraph Encodings

Neural Information Processing Systems

Higher-order information is crucial for relational learning in many domains where relationships extend beyond pairwise interactions. Hypergraphs provide a natural framework for modeling such relationships, which has motivated recent extensions of graph neural network (GNN) architectures to hypergraphs. Most of these architectures rely on message-passing to encode higher-order information. In this paper, we propose to instead use hypergraph-level encodings based on characteristics such as hypergraph Laplacians and discrete curvature notions. These encodings can be used on datasets that are naturally parametrized as hypergraphs and on graph-level datasets, which we reparametrize as hypergraphs to compute encodings.


Joint Hierarchical Representation Learning of Samples and Features via Informed Tree-Wasserstein Distance

Neural Information Processing Systems

Yet, most existing approaches for hierarchical representation learning consider only one mode at a time. In this work, we propose an unsupervised method for jointly learning hierarchical representations of samples and features via Tree-Wasserstein Distance (TWD). Our method alternates between the two data modes. It first constructs a tree for one mode, then computes a TWD for the other mode based on that tree, and finally uses the resulting TWD to build the second mode's tree.


Structured Reinforcement Learning for Combinatorial Decision-Making

Neural Information Processing Systems

Reinforcement learning (RL) is increasingly applied to real-world problems involving complex and structured decisions, such as routing, scheduling, and assortment planning. These settings challenge standard RL algorithms, which struggle to scale, generalize, and exploit structure in the presence of combinatorial action spaces. We propose Structured Reinforcement Learning (SRL), a novel actor-critic paradigm that embeds combinatorial optimization-layers into the actor neural network. We enable end-to-end learning of the actor via Fenchel-Young losses and provide a geometric interpretation of SRL as a primal-dual algorithm in the dual of the moment polytope. Across six environments with exogenous and endogenous uncertainty, SRL matches or surpasses the performance of unstructured RL and imitation learning on static tasks and improves over these baselines by up to 92\% on dynamic problems, with improved stability and convergence speed.


Causal Explanation-Guided Learning for Organ Allocation

Neural Information Processing Systems

A central challenge in organ transplantation is the extremely low acceptance rate of donor organ offers--typically in the single digits--leading to high discard rates and suboptimal use of available grafts. Current acceptance models embedded in allocation systems are non-causal, trained on observational data, and fail to generalize to policy-relevant counterfactuals. This limits their reliability for both policy evaluation and simulator-based optimization. In this work, we reframe organ offer acceptance as a counterfactual prediction problem and propose a method to learn from routinely recorded--but often overlooked--refusal explanations. These refusal reasons act as direction-only counterfactual signals: for example, a refusal reason such as old donor age implies acceptance might have occurred had the donor been younger. We formalize this setting and introduce ClexNet, a novel causal model that learns policy-invariant representations via balanced training and an explanation-guided augmentation loss. On both synthetic and semi-synthetic data, ClexNet outperforms existing acceptance models in predictive performance, generalization, and calibration, offering a robust drop-in improvement for simulators and allocation policy evaluation. Beyond transplantation, our approach provides a general method for incorporating human direction-only explanations as a form of model supervision, improving performance in settings where only observational data is available.


Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Neural Information Processing Systems

Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.


Scaling Up Active Testing to Large Language Models

Neural Information Processing Systems

Active testing enables label-efficient evaluation of predictive models through careful data acquisition, but it can pose a significant computational cost. We identify cost-saving measures that enable active testing to be scaled up to large language models (LLMs). In particular we show that the surrogate model used to guide data acquisition can be constructed cheaply using in-context learning, does not require updating within an active-testing loop, and can be smaller than the target model. We even find we can make good data-acquisition decisions without making predictions with the target model. As a result we are able to achieve much more accurate evaluations of LLM performance relative to using randomly acquired data. We additionally introduce a bootstrap estimator of evaluation error, which we show to be a useful indicator of how well active testing is working within a single run.


Geometry-Aware Collaborative Multi-Solutions Optimizer for Model Fine-Tuning with Parameter Efficiency

Neural Information Processing Systems

We propose a framework grounded in gradient flow theory and informed by geometric structure that provides multiple diverse solutions for a given task, ensuring collaborative results that enhance performance and adaptability across different tasks. This framework enables flexibility, allowing for efficient task-specific fine-tuning while preserving the knowledge of the pre-trained foundation models. Extensive experiments across transfer learning, few-shot learning, and domain generalization show that our proposed approach consistently outperforms existing Bayesian methods, delivering strong performance with affordable computational overhead and offering a practical solution by updating only a small subset of parameters.


CTRL-ALT-DECEIT Sabotage Evaluations for Automated AI R&D

Neural Information Processing Systems

AI systems are increasingly able to autonomously conduct realistic software engineering tasks, and may soon be deployed to automate machine learning (ML) R&D itself. Frontier AI systems may be deployed in safety-critical settings, including to help ensure the safety of future systems. Unfortunately, frontier and future systems may not be sufficiently trustworthy, and there is evidence that these systems may even be misaligned with their developers or users. Therefore, we investigate the capabilities of AI agents to act against the interests of their users when conducting ML engineering, by sabotaging ML models, sandbagging their performance, and subverting oversight mechanisms. First, we extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures.


SD-KDE: Score-Debiased Kernel Density Estimation

Neural Information Processing Systems

We propose a method for density estimation that leverages an estimated score function to debias kernel density estimation (SD-KDE). In our approach, each data point is adjusted by taking a single step along the score function with a specific choice of step size, followed by standard KDE with a modified bandwidth. The step size and modified bandwidth are chosen to remove the leading order bias in the KDE, improving the asymptotic convergence rate. Our experiments on synthetic tasks in 1D, 2D and on MNIST, demonstrate that our proposed SD-KDE method significantly reduces the mean integrated squared error compared to the standard Silverman KDE, even with noisy estimates in the score function. These results underscore the potential of integrating score-based corrections into nonparametric density estimation.