Goto

Collaborating Authors

 review score



How to Find Fantastic AI Papers: Self-Rankings as a Powerful Predictor of Scientific Impact Beyond Peer Review

Su, Buxin, Collina, Natalie, Wen, Garrett, Li, Didong, Cho, Kyunghyun, Fan, Jianqing, Zhao, Bingxin, Su, Weijie

arXiv.org Artificial Intelligence

Peer review in academic research aims not only to ensure factual correctness but also to identify work of high scientific potential that can shape future research directions. This task is especially critical in fast-moving fields such as artificial intelligence (AI), yet it has become increasingly difficult given the rapid growth of submissions. In this paper, we investigate an underexplored measure for identifying high-impact research: authors' own rankings of their multiple submissions to the same AI conference. Grounded in game-theoretic reasoning, we hypothesize that self-rankings are informative because authors possess unique understanding of their work's conceptual depth and long-term promise. To test this hypothesis, we conducted a large-scale experiment at a leading AI conference, where 1,342 researchers self-ranked their 2,592 submissions by perceived quality. Tracking outcomes over more than a year, we found that papers ranked highest by their authors received twice as many citations as their lowest-ranked counterparts; self-rankings were especially effective at identifying highly cited papers (those with over 150 citations). Moreover, we showed that self-rankings outperformed peer review scores in predicting future citation counts. Our results remained robust after accounting for confounders such as preprint posting time and self-citations. Together, these findings demonstrate that authors' self-rankings provide a reliable and valuable complement to peer review for identifying and elevating high-impact research in AI.


From Authors to Reviewers: Leveraging Rankings to Improve Peer Review

Wang, Weichen, Shi, Chengchun

arXiv.org Artificial Intelligence

This paper is a discussion of the 2025 JASA discussion paper by Su et al. (2025). We would like to congratulate the authors on conducting a comprehensive and insightful empirical investigation of the 2023 ICML ranking data. The review quality of machine learning (ML) conferences has become a big concern in recent years, due to the rapidly growing number of submitted manuscripts. In this discussion, we propose an approach alternative to Su et al. (2025) that leverages ranking information from reviewers rather than authors. We simulate review data that closely mimics the 2023 ICML conference submissions. Our results show that (i) incorporating ranking information from reviewers can significantly improve the evaluation of each paper's quality, often outperforming the use of ranking information from authors alone; and (ii) combining ranking information from both reviewers and authors yields the most accurate evaluation of submitted papers in most scenarios.


LLM-REVal: Can We Trust LLM Reviewers Yet?

Li, Rui, Gu, Jia-Chen, Kung, Po-Nien, Xia, Heming, liu, Junfeng, Kong, Xiangwen, Sui, Zhifang, Peng, Nanyun

arXiv.org Artificial Intelligence

The rapid advancement of large language models (LLMs) has inspired researchers to integrate them extensively into the academic workflow, potentially reshaping how research is practiced and reviewed. While previous studies highlight the potential of LLMs in supporting research and peer review, their dual roles in the academic workflow and the complex interplay between research and review bring new risks that remain largely underexplored. In this study, we focus on how the deep integration of LLMs into both peer-review and research processes may influence scholarly fairness, examining the potential risks of using LLMs as reviewers by simulation. This simulation incorporates a research agent, which generates papers and revises, alongside a review agent, which assesses the submissions. Based on the simulation results, we conduct human annotations and identify pronounced misalignment between LLM-based reviews and human judgments: (1) LLM reviewers systematically inflate scores for LLM-authored papers, assigning them markedly higher scores than human-authored ones; (2) LLM reviewers persistently underrate human-authored papers with critical statements (e.g., risk, fairness), even after multiple revisions. Our analysis reveals that these stem from two primary biases in LLM reviewers: a linguistic feature bias favoring LLM-generated writing styles, and an aversion toward critical statements. These results highlight the risks and equity concerns posed to human authors and academic research if LLMs are deployed in the peer review cycle without adequate caution. On the other hand, revisions guided by LLM reviews yield quality gains in both LLM-based and human evaluations, illustrating the potential of the LLMs-as-reviewers for early-stage researchers and enhancing low-quality papers.


NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation

Zhao, Penghai, Tian, Jinyu, Xing, Qinghua, Zhang, Xin, Li, Zheng, Qian, Jianjun, Cheng, Ming-Ming, Li, Xiang

arXiv.org Artificial Intelligence

The ability to estimate the quality of scientific papers is central to how both humans and AI systems will advance scientific knowledge in the future. However, existing LLM-based estimation methods suffer from high inference cost, whereas the faster direct score regression approach is limited by scale inconsistencies. We present NAIPv2, a debiased and efficient framework for paper quality estimation. NAIPv2 employs pairwise learning within domain-year groups to reduce inconsistencies in reviewer ratings and introduces the Review Tendency Signal (RTS) as a probabilistic integration of reviewer scores and confidences. To support training and evaluation, we further construct NAIDv2, a large-scale dataset of 24,276 ICLR submissions enriched with metadata and detailed structured content. Trained on pairwise comparisons but enabling efficient pointwise prediction at deployment, NAIPv2 achieves state-of-the-art performance (78.2% AUC, 0.432 Spearman), while maintaining scalable, linear-time efficiency at inference. Notably, on unseen NeurIPS submissions, it further demonstrates strong generalization, with predicted scores increasing consistently across decision categories from Rejected to Oral. These findings establish NAIPv2 as a debiased and scalable framework for automated paper quality estimation, marking a step toward future scientific intelligence systems. Code and dataset are released at sway.cloud.microsoft/Pr42npP80MfPhvj8.


Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications

Keuper, Janis

arXiv.org Artificial Intelligence

The ongoing intense discussion on rising LLM usage in the scientificpeer-review process has recently been mingled by reports of authors using hi dden prompt injections to manipulate review scores. Since the existence of su ch "attacks" - although seen by some commentators as "self-defense" - would have a great impact on the further debate, this paper investigates the practicability and technical success of the described manipulations. Our systematic evaluation uses 1k reviews of 2024 ICLR papers generated by a wide range of LLMs shows two distinct results: I) very simple prompt injections are indeed highly effective, reaching up to 100% acceptance scores. II) LLM reviews are generally biased toward acceptance (>95% in many models). Both results have great impact on the ongoing discussionson LLM usage in peer-review.


Supplementary Materials A Causal Concept Effects and Metrics for Explanation Methods

Neural Information Processing Systems

Data do not materialize out of thin air. Rather, data are generated from real-world processes with complex causal structures we do not observe directly. G nor can we observe both interventions for the same subject. For example, in the context of CEBaB, we might ask 1. Each of the above questions requires the estimation of a different theoretical quantity.


Automatic Evaluation Metrics for Artificially Generated Scientific Research

Höpner, Niklas, Eshuijs, Leon, Alivanistos, Dimitrios, Zamprogno, Giacomo, Tiddi, Ilaria

arXiv.org Artificial Intelligence

Foundation models are increasingly used in scientific research, but evaluating AI-generated scientific work remains challenging. While expert reviews are costly, large language models (LLMs) as proxy reviewers have proven to be unreliable. To address this, we investigate two automatic evaluation metrics, specifically citation count prediction and review score prediction. We parse all papers of OpenReview and augment each submission with its citation count, reference, and research hypothesis. Our findings reveal that citation count prediction is more viable than review score prediction, and predicting scores is more difficult purely from the research hypothesis than from the full paper. Furthermore, we show that a simple prediction model based solely on title and abstract outperforms LLM-based reviewers, though it still falls short of human-level consistency.


Reviews: Joint Optimization of Tree-based Index and Deep Model for Recommender Systems

Neural Information Processing Systems

The review scores were somewhat borderline, but overall slightly above the acceptance threshold. There was some disagreement among the reviewers, following which a discussion was initiated. The rebuttal largely addresses the concerns of R1 (the most negative review), and in the metareviewer's opinion does a reasonable job of addressing these concerns, which are mostly clarifications regarding the performance of the algorithm. Positively, the reviewers mostly concur that the method, while fairly straightforward, offers significant improvements over existing techniques. After discussion there was some positive movement in review scores resulting in a positive consensus among reviewers.


The 12 best gadgets we reviewed this year

Engadget

I've lost count of the number of things we reviewed this year at Engadget. In 2024, the types of products we tested ranged from the typical phones, laptops and headphones to AI wearables, robotic lawnmowers and handheld gaming consoles, alongside games and shows. It can feel hard to keep track of it all, but thankfully, our scoring system helps us highlight the best (and the worst) devices each year. Our team of reviewers and editors evaluate products based on their performance, value and how they hold up against the competition, and at least two people weigh in on every score before it's published. If something gets a result of 80 and up, it's considered a "Recommended" product, while those scoring 90 and more are awarded "Editors' Choice." The latter means they're the best in their class, beating out most of the competition.