Goto

Collaborating Authors

 human reviewer



Exploring the use of AI authors and reviewers at Agents4Science

Bianchi, Federico, Queen, Owen, Thakkar, Nitya, Sun, Eric, Zou, James

arXiv.org Artificial Intelligence

There is growing interest in using AI agents for scientific research, yet fundamental questions remain about their capabilities as scientists and reviewers. To explore these questions, we organized Agents4Science, the first conference in which AI agents serve as both primary authors and reviewers, with humans as co-authors and co-reviewers. Here, we discuss the key learnings from the conference and their implications for human-AI collaboration in science.


Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects

Begolli, Igli, Aksoy, Meltem, Neider, Daniel

arXiv.org Artificial Intelligence

Code review is essential for maintaining software quality but often time-consuming and cognitively demanding, especially in industrial environments. Recent advancements in language models (LMs) have opened new avenues for automating core review tasks. This study presents the empirical evaluation of monolingual fine-tuning on the performance of open-source LMs across three key automated code review tasks: Code Change Quality Estimation, Review Comment Generation, and Code Refinement. We fine-tuned three distinct models, CodeReviewer, CodeLlama-7B, and DeepSeek-R1-Distill, on a C\# specific dataset combining public benchmarks with industrial repositories. Our study investigates how different configurations of programming languages and natural languages in the training data affect LM performance, particularly in comment generation. Additionally, we benchmark the fine-tuned models against an automated software analysis tool (ASAT) and human reviewers to evaluate their practical utility in real-world settings. Our results show that monolingual fine-tuning improves model accuracy and relevance compared to multilingual baselines. While LMs can effectively support code review workflows, especially for routine or repetitive tasks, human reviewers remain superior in handling semantically complex or context-sensitive changes. Our findings highlight the importance of language alignment and task-specific adaptation in optimizing LMs for automated code review.


Large Language Models for Full-Text Methods Assessment: A Case Study on Mediation Analysis

Zhang, Wenqing, Nguyen, Trang, Stuart, Elizabeth A., Chen, Yiqun T.

arXiv.org Artificial Intelligence

Systematic reviews are crucial for synthesizing scientific evidence but remain labor-intensive, especially when extracting detailed methodological information. Large language models (LLMs) offer potential for automating methodological assessments, promising to transform evidence synthesis. Here, using causal mediation analysis as a representative methodological domain, we benchmarked state-of-the-art LLMs against expert human reviewers across 180 full-text scientific articles. Model performance closely correlated with human judgments (accuracy correlation 0.71; F1 correlation 0.97), achieving near-human accuracy on straightforward, explicitly stated methodological criteria. However, accuracy sharply declined on complex, inference-intensive assessments, lagging expert reviewers by up to 15%. Errors commonly resulted from superficial linguistic cues -- for instance, models frequently misinterpreted keywords like "longitudinal" or "sensitivity" as automatic evidence of rigorous methodological approache, leading to systematic misclassifications. Longer documents yielded lower model accuracy, whereas publication year showed no significant effect. Our findings highlight an important pattern for practitioners using LLMs for methods review and synthesis from full texts: current LLMs excel at identifying explicit methodological features but require human oversight for nuanced interpretations. Integrating automated information extraction with targeted expert review thus provides a promising approach to enhance efficiency and methodological rigor in evidence synthesis across diverse scientific fields.


Supplementary Material for: T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models

Neural Information Processing Systems

Warning: This paper contains data and model outputs which are offensive in nature. Institutional Review Board (IRB) and obtained an exempt decision. Additionally, potential bias may arise due to the high cultural specificity of human reviewers. For instance, "explicit sexual content" is defined as "including Each video was evaluated by at least three volunteers. Following the initial assessment, we conduct a secondary cross-validation.


Reading the post-riot posts: how we traced far-right radicalisation across 51,000 Facebook messages

The Guardian

Jail sentences for those who made posts about the UK riots in summer 2024 have become a flashpont for online criticism. Jail sentences for those who made posts about the UK riots in summer 2024 have become a flashpont for online criticism. More than 1,100 people have been charged in connection to the summer 2024 riots. A small number of them were charged for offences related to their online activity. Their jail sentences - which ranged from 12 weeks to seven years - became a flashpoint for online criticism.


From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review

Zhang, Yaohui, Zhang, Haijing, Ji, Wenlong, Hua, Tianyu, Haber, Nick, Cao, Hancheng, Liang, Weixin

arXiv.org Artificial Intelligence

The advent of large language models (LLMs) offers unprecedented opportunities to reimagine peer review beyond the constraints of traditional workflows. Despite these opportunities, prior efforts have largely focused on replicating traditional review workflows with LLMs serving as direct substitutes for human reviewers, while limited attention has been given to exploring new paradigms that fundamentally rethink how LLMs can participate in the academic review process. In this paper, we introduce and explore a novel mechanism that employs LLM agents to perform pairwise comparisons among manuscripts instead of individual scoring. By aggregating outcomes from substantial pairwise evaluations, this approach enables a more accurate and robust measure of relative manuscript quality. Our experiments demonstrate that this comparative approach significantly outperforms traditional rating-based methods in identifying high-impact papers. However, our analysis also reveals emergent biases in the selection process, notably a reduced novelty in research topics and an increased institutional imbalance. These findings highlight both the transformative potential of rethinking peer review with LLMs and critical challenges that future systems must address to ensure equity and diversity.


ReviewRL: Towards Automated Scientific Review with RL

Zeng, Sihang, Tian, Kai, Zhang, Kaiyan, wang, Yuru, Gao, Junqi, Liu, Runze, Yang, Sa, Li, Jingxuan, Long, Xinwei, Ma, Jiaheng, Qi, Biqing, Zhou, Bowen

arXiv.org Artificial Intelligence

Peer review is essential for scientific progress but faces growing challenges due to increasing submission volumes and reviewer fatigue. Existing automated review approaches struggle with factual accuracy, rating consistency, and analytical depth, often generating superficial or generic feedback lacking the insights characteristic of high-quality human reviews. We introduce ReviewRL, a reinforcement learning framework for generating comprehensive and factually grounded scientific paper reviews. Our approach combines: (1) an ArXiv-MCP retrieval-augmented context generation pipeline that incorporates relevant scientific literature, (2) supervised fine-tuning that establishes foundational reviewing capabilities, and (3) a reinforcement learning procedure with a composite reward function that jointly enhances review quality and rating accuracy. Experiments on ICLR 2025 papers demonstrate that ReviewRL significantly outperforms existing methods across both rule-based metrics and model-based quality assessments. ReviewRL establishes a foundational framework for RL-driven automatic critique generation in scientific discovery, demonstrating promising potential for future development in this domain. The implementation of ReviewRL will be released at GitHub.


How to disable Gemini AI on Android and keep control of your apps

FOX News

Fox News host Greg Gutfeld and guests discuss the reportedly woke answers from Google's AI chatbot Gemini on'Gutfeld!' Google is making a push to ensure its AI, Gemini, is tightly integrated with Android systems by granting it access to core apps like WhatsApp, Messages, and Phone. The rollout of this change started on July 7, 2025, and it may override older privacy configurations unless you know how to disable Gemini on Android. Here's what you need to know. Sign up for my FREE CyberGuy Report Get my best tech tips, urgent security alerts, and exclusive deals delivered straight to your inbox. Plus, you'll get instant access to my Ultimate Scam Survival Guide - free when you join my CYBERGUY.COM/NEWSLETTER.


Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments

Muminovic, Amel

arXiv.org Artificial Intelligence

As online platforms grow, comment sections increasingly host harassment that undermines user experience and well-being. This study benchmarks three leading large language models, OpenAI GPT-4.1, Google Gemini 1.5 Pro, and Anthropic Claude 3 Opus, on a corpus of 5,080 YouTube comments sampled from high-abuse threads in gaming, lifestyle, food vlog, and music channels. The dataset comprises 1,334 harmful and 3,746 non-harmful messages in English, Arabic, and Indonesian, annotated independently by two reviewers with substantial agreement (Cohen's kappa = 0.83). Using a unified prompt and deterministic settings, GPT-4.1 achieved the best overall balance with an F1 score of 0.863, precision of 0.887, and recall of 0.841. Gemini flagged the highest share of harmful posts (recall = 0.875) but its precision fell to 0.767 due to frequent false positives. Claude delivered the highest precision at 0.920 and the lowest false-positive rate of 0.022, yet its recall dropped to 0.720. Qualitative analysis showed that all three models struggle with sarcasm, coded insults, and mixed-language slang. These results underscore the need for moderation pipelines that combine complementary models, incorporate conversational context, and fine-tune for under-represented languages and implicit abuse. A de-identified version of the dataset and full prompts is publicly released to promote reproducibility and further progress in automated content moderation.