Goto

Collaborating Authors

 Industry


Rigor in AI: Doing Rigorous AIWork Requires a Broader, Responsible AI-Informed Conception of Rigor

Neural Information Processing Systems

In AI research and practice, rigor remains largely understood in terms of methodological rigor--such as whether mathematical, statistical, or computational methods are correctly applied. We argue that this narrow conception of rigor has contributed to the concerns raised by the responsible AI community, including overblown claims about the capabilities of AI systems. Our position is that a broader conception of what rigorous AI research and practice should entail is needed. We believe such a conception--in addition to a more expansive understanding of (1) methodological rigor--should include aspects related to (2) what background knowledge informs what to work on (epistemic rigor); (3) how disciplinary, community, or personal norms, standards, or beliefs influence the work (normative rigor); (4) how clearly articulated the theoretical constructs under use are (conceptual rigor); (5) what is reported and how (reporting rigor); and (6) how well-supported the inferences from existing evidence are (interpretative rigor). In doing so, we also provide useful language and a framework for much-needed dialogue about the AI community's work by researchers, policymakers, journalists, and other stakeholders.


GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization

Neural Information Processing Systems

Worldwide image geolocalization--the task of predicting GPS coordinates from images taken anywhere on Earth--poses a fundamental challenge due to the vast diversity in visual content across regions. While recent approaches adopt a twostage pipeline of retrieving candidates and selecting the best match, they typically rely on simplistic similarity heuristics and point-wise supervision, failing to model spatial relationships among candidates. In this paper, we propose GeoRanker, a distance-aware ranking framework that leverages large vision-language models to jointly encode query-candidate interactions and predict geographic proximity. In addition, we introduce a multi-order distance loss that ranks both absolute and relative distances, enabling the model to reason over structured spatial relationships. To support this, we curate GeoRanking, the first dataset explicitly designed for geographic ranking tasks with multimodal candidate information. GeoRanker achieves state-of-the-art results on two well-established benchmarks (IM2GPS3K and YFCC4K), significantly outperforming current best methods. We also release our code, checkpoint, and dataset online2 for ease of reproduction.


Norwegian crown princess's son found guilty of two counts of rape

BBC News

Norwegian crown princess's son found guilty of two counts of rape Marius Borg Høiby, the 29-year-old son of Norway's Crown Princess Mette-Marit, has been found guilty of two counts of rape and given four years in prison. The three judges in courtroom 250 at Oslo District Court cleared him of two other counts of rape, but found him guilty of many of the other offences of which he had been accused. Høiby was not in court for the verdict, but joined the session via video link. Prosecutors had called for Høiby to be given seven years and seven months in prison. His defence lawyers had called for a lesser term of 18 months and can appeal against the verdict.


Enhancing via Cross Modality Alignment

Neural Information Processing Systems

Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization, they often overlook the gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose CrOss-modaLity Alignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space.


Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Neural Information Processing Systems

Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as'safety' and'robustness' requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks.


Versatile Transferable Unlearnable Example Generator

Neural Information Processing Systems

The rapid growth of publicly available data has fueled deep learning advancements but also raises concerns about unauthorized data usage. Unlearnable Examples (UEs) have emerged as a data protection strategy that introduces imperceptible perturbations to prevent unauthorized learning. However, most existing UE methods produce perturbations strongly tied to specific training sets, leading to a significant drop in unlearnability when applied to unseen data or tasks. In this paper, we argue that for broad applicability, UEs should maintain their effectiveness across diverse application scenarios. To this end, we conduct the first comprehensive study on the transferability of UEs across diverse and practical yet demanding settings. Specifically, we identify key scenarios that pose significant challenges for existing UE methods, including varying styles, out-of-distribution classes, resolutions, and architectures.


Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models

Neural Information Processing Systems

Despite the remarkable reasoning performance, eliciting the long chain-ofthought (CoT) ability in large language models (LLMs) typically requires costly reinforcement learning or supervised fine-tuning on high-quality distilled data. We investigate the internal mechanisms behind this capability and show that a small set of high-impact activations in the last few layers, greatly govern the long-form reasoning attributes, e.g., output length and self-reflection. Through simply amplifying these activations and adding "wait" tokens, the long CoT ability can be invoked without training, leading to significantly increased self-reflection rate and accuracy. In addition, we also find that the activation changes follow predictable trajectories, i.e., a sharp rise after special tokens and a subsequent exponential decay. Based on these insights, we introduce a general training-free activation control technique. It utilizes a few contrastive examples to identify the relevant activations, and then incorporates simple analytic functions to adjust their values at inference time to elicit long CoTs. Extensive experiments have verified the effectiveness of our methods in efficiently eliciting the long CoT ability of LLMs and improving the performance. Besides, we further propose a parameter-efficient fine-tuning method that trains only the last-layer activation amplification module and a few LoRA layers, outperforming LoRA on reasoning benchmarks with much fewer parameters.


19206a6ed5ed0aaeed440448dfc5cf7e-Paper-Conference.pdf

Neural Information Processing Systems

LLM-agent systems often decompose high-level objectives into subtask dependency graphs, assuming that each subtask's output is reliable and conditionally independent of others given its parent responses. However, this assumption frequently breaks during execution, as ground-truth responses are inaccessible, leading to inter-agent misalignment--failures caused by inconsistencies and coordination breakdowns among agents [1]. To address this, we propose SEQCV, a dynamic framework for reliable execution under violated conditional independence. SEQCV executes subtasks sequentially, each conditioned on all prior verified responses, and performs consistency checks immediately after agents generate short token sequences. At each checkpoint, a token sequence is accepted only if it represents shared knowledge consistently supported across diverse LLM models; otherwise, it is discarded, triggering recursive subtask decomposition for finer-grained reasoning. Despite its sequential nature, SEQCV avoids repeated corrections on the same misalignment and achieves higher effective throughput than parallel pipelines. Across multiple reasoning and coordination tasks, SEQCV improves accuracy by up to 30% over existing LLM-agent systems.


Macron's G7 legacy hangs on fickle AI funding and data centers

The Japan Times

Macron's G7 legacy hangs on fickle AI funding and data centers With less than a year left in office, Emmanuel Macron wants to be remembered as the French president who put Europe back in the technology race. His decade-old ambition to turn France into a "startup nation" never fully delivered. Now Macron sees a second chance by positioning France as Europe's artificial intelligence powerhouse, leveraging the nation's abundant supply of nuclear energy for data centers. He convinced SoftBank Group to invest as much as €75 billion ($87 billion) in French projects. His advisers have dubbed the AI effort "Project Marengo," a reference to Napoleon Bonaparte's victory over an Austrian army in 1800 at the battle of the same name, won through speed and decisive action. Marengo was also a political victory, securing Bonaparte's hold on power.


CGBENCH: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research

Neural Information Processing Systems

Variant and gene interpretation are fundamental to personalized medicine and translational biomedicine. However, traditional approaches are manual and labor-intensive. Generative language models (LMs) can facilitate this process, accelerating the translation of fundamental research into clinically-actionable insights. While existing benchmarks have attempted to quantify the capabilities of LMs for interpreting scientific data, these studies focus on narrow tasks that do not translate to real-world research. To meet these challenges, we introduce CGBENCH, a robust benchmark that tests reasoning capabilities of LMs on scientific publications.