Goto

Collaborating Authors

 submission


ACramér-von Mises Approach to Incentivizing Truthful Data Sharing

Neural Information Processing Systems

Modern data marketplaces and data sharing consortia increasingly rely on incentive mechanisms to encourage agents to contribute data. However, schemes that reward agents based on the quantity of submitted data are vulnerable to manipulation, as agents may submit fabricated or low-quality data to inflate their rewards. Prior work has proposed comparing each agent's data against others' to promote honesty: when others contribute genuine data, the best way to minimize discrepancy is to do the same. Yet prior implementations of this idea rely on very strong assumptions about the data distribution (e.g.


9118ad115831e52cfeec1acd40c6e0f3-Paper-Position_Paper_Track.pdf

Neural Information Processing Systems

Science progresses by iteratively advancing and correcting humanity's understanding of the world. In machine learning (ML) research, rapid advancements have led to an explosion of publications, but have also led to misleading, incorrect, flawed or perhaps even fraudulent studies being accepted and sometimes highlighted at ML conferences due to the fallibility of peer review. While such mistakes are understandable, ML conferences do not offer robust processes to help the field systematically correct when such errors are made. This position paper argues that ML conferences should establish a dedicated "Refutations and Critiques" (R&C) Track. This R&CTrack would provide a high-profile, reputable platform to support vital research that critically challenges prior research, thereby fostering a dynamic self-correcting research ecosystem. We discuss key considerations including track design, review principles, potential pitfalls, and provide an illustrative example submission concerning a recent ICLR 2025 Oral. We conclude that ML conferences should create official, reputable mechanisms to help ML research self-correct.


Pump.Fun's Bounties Platform Is a Black Hole of Circular Grifting

WIRED

Pump.Fun's Bounties Platform Is a Black Hole of Circular Grifting The crypto platform claims you can "pay anyone to do anything," from quitting a job on camera to getting a memecoin-themed tattoo. But it mostly seems like people trying to scam each other. Would you run into a crowded university lecture hall, fart into a megaphone, and bellow "fartcoin" at the top of your lungs? If so--and should you have the means to document this stunt on video, preferably capturing the audience's reaction--you may claim a reward of approximately $1,000 . The money, of course, will be dispensed in fartcoin, a meme cryptocurrency trading at a little over 10 cents at time of publication, with a total market capitalization hovering around $130 million. Such is the promise of Pump.Fun GO, a new feature on Pump.Fun, one of the fastest-growing crypto businesses of the past few years.


BENCH Can Language Agents Solve Machine

Neural Information Processing Systems

We introduce MLRC-BENCH, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions, with a focus on open research problems that demand novel methodologies. Unlike prior work, e.g., AIScientist [40], which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-BENCH measures the key steps of proposing and implementing novel research methods and evaluates them with rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB [22]) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and their actual performance on cutting-edge ML research problems. MLRC-BENCH is a dynamic benchmark, which is designed to continually grow with new ML competitions to encourage rigorous and objective evaluations of AI's research capabilities. Our leaderboard and code are publicly available at https://huggingface.co/spaces/launch/MLRC_Bench.


Stop DDoS Attacking the Research Community with AI-Generated Survey Papers

Neural Information Processing Systems

Survey papers are foundational to the scholarly progress of research communities, offering structured overviews that guide both novices and experts across disciplines. However, the recent surge of AI-generated surveys, especially enabled by large language models (LLMs), has transformed this traditionally labor-intensive genre into a low-effort, high-volume output. While such automation lowers entry barriers, it also introduces a critical threat: the phenomenon we term the "survey paper DDoS attack" to the research community. This refers to the unchecked proliferation of superficially comprehensive but often redundant, low-quality, or even hallucinated survey manuscripts, which floods preprint platforms, overwhelms researchers, and erodes trust in the scientific record. In this position paper, we argue that we must stop uploading massive amounts of AI-generated survey papers (i.e., survey paper DDoS attack) to the research community, by instituting strong norms for AI-assisted review writing. We call for restoring expert oversight and transparency in AI usage and, moreover, developing new infrastructures such as Dynamic Live Surveys, community-maintained, version-controlled repositories that blend automated updates with human curation. Through quantitative trend analysis, quality audits, and cultural impact discussion, we show that safeguarding the integrity of surveys is no longer optional but imperative to the research community.


Security Challenges in AIAgent Deployment: Insights from a Large Scale Public Competition

Neural Information Processing Systems

Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted to follow deployment policies in realistic environments, especially under attack? To investigate, we ran the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios. Participants submitted 1.8 million promptinjection attacks, with over 60,000 successfully eliciting policy violations such as unauthorized data access, illicit financial actions, and regulatory noncompliance. We use these results to build the Agent Red Teaming (ART) benchmark--a curated set of high-impact attacks--and evaluate it across 19state-of-the-art models.


The Leaderboard Illusion

Neural Information Processing Systems

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also become more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have skewed the competitive landscape. Specifically, undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and selectively retract scores.


Results of the Big ANN: NeurIPS'23 competition

Neural Information Processing Systems

The 2023 Big ANNChallenge, held at NeurIPS'23, aimed at advancing the stateof-the-art in indexing data structures and search algorithms. It focused for practical variants of Approximate Nearest Neighbor (ANN) search that reflect the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search [21], this competition addressed filtered search, out-of-distribution data, sparse and streaming variants of ANNS. Participants developed and submitted innovative solutions that were evaluated on new standard datasets with constrained computational resources.


1543d6d5cb976e4f9fbfaedf2e257967-Supplemental-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

LCDB 1.1: ADatabase Illustrating Learning Curves Are More Ill-Behaved Than Previously Thought For the actual appendices, please see the main paper submission. Here, we would like to make a few2 notes regarding the dataset hosting.3 Self-Hosting Platform Our dataset is self-hosted on the 4TU.ResearchData platform, a trusted4 institutional repository based in the Netherlands, which guarantees long-term preservation of research5 data for a minimum of 15 years.16 Data Access Note We provide a public access link (also attached in the main submission).27 Machine Access via Croissant Metadata For machine access, Croissant metadata file can be8 found in our GitHub repository.39


Appendices776

Neural Information Processing Systems

ALimitations777 As described in Sections 4 and 6, users would tailor attacks to image clusters. In the case of beige778 box, we outright provided these clusters by disclosing which image indices corresponded to which779 general watermark type. For the black-box track, several winning teams clustered images into groups780 by artifact varieties and did so by hand. For the latter, this was made possible because (1) our data set781 was relatively small, enabling this type of manual data labeling, and (2) they were made aware that782 the dataset contained mixtures of several watermarks. A database owner who uses only one type of783 watermark will unlikely produce such variation in artifacts.784 Additionally, we use the watermark models and setting provided in the original papers and do not785 calibrate the strength of watermarks.