verdict
- Africa > Nigeria (0.14)
- North America > Canada (0.14)
- Africa > Kenya (0.14)
- (22 more...)
- Research Report (0.67)
- Workflow (0.46)
- Media > News (1.00)
- Law (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Health & Medicine > Epidemiology (0.68)
- Africa > Nigeria (0.14)
- Africa > Kenya (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- (38 more...)
- Media > News (1.00)
- Law (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- (4 more...)
Our verdict on Annie Bot: This novel about a sex robot split opinions
Members of the New Scientist Book Club give their take on Sierra Greer's award-winning science-fiction novel Annie Bot, our read for February - and the needle swings wildly from positive to negative Annie Bot by Sierra Greer was the Book Club's January read The New Scientist Book Club moved on from reading a classic piece science fiction in December - Iain M. Banks's - to an award-winning sci-fi novel in January: Sierra Greer's, which won the Arthur C. Clarke prize in 2025. I must admit, I was nervous to announce this one to my fellow readers. is the story of a sex robot, owned by a controlling and abusive man. It gets very dark in places, it has a number of sex scenes, and I wanted to make sure you all knew what you were getting into before getting started. That cupboard scene, some way into the book, was super disturbing, for example. It turns out my wariness was warranted.
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Robots (0.75)
Exploring Health Misinformation Detection with Multi-Agent Debate
Chen, Chih-Han, Tsai, Chen-Han, Peng, Yu-Shao
Fact-checking health-related claims has become increasingly critical as misinformation proliferates online. Effective verification requires both the retrieval of high-quality evidence and rigorous reasoning processes. In this paper, we propose a two-stage framework for health misinformation detection: Agreement Score Prediction followed by Multi-Agent Debate. In the first stage, we employ large language models (LLMs) to independently evaluate retrieved articles and compute an aggregated agreement score that reflects the overall evidence stance. When this score indicates insufficient consensus-falling below a predefined threshold-the system proceeds to a second stage. Multiple agents engage in structured debate to synthesize conflicting evidence and generate well-reasoned verdicts with explicit justifications. Experimental results demonstrate that our two-stage approach achieves superior performance compared to baseline methods, highlighting the value of combining automated scoring with collaborative reasoning for complex verification tasks.
- Media > News (0.93)
- Health & Medicine (0.68)
Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Liu, Yuhan, Qin, Lianhui, Wang, Shengjie
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict.
VERIRAG: A Post-Retrieval Auditing of Scientific Study Summaries
Mohole, Shubham, Choi, Hongjun, Liu, Shusen, Klymko, Christine, Kushwaha, Shashank, Shi, Derek, Sakla, Wesam, Galhotra, Sainyam, Glatt, Ruben
Can democratized information gatekeepers and community note writers effectively decide what scientific information to amplify? Lacking domain expertise, such gatekeepers rely on automated reasoning agents that use RAG to ground evidence to cited sources. But such standard RAG systems validate summaries via semantic grounding and suffer from "methodological blindness," treating all cited evidence as equally valid regardless of rigor. To address this, we introduce VERIRAG, a post-retrieval auditing framework that shifts the task from classification to methodological vulnerability detection. Using private Small Language Models (SLMs), VERIRAG audits source papers against the Veritable taxonomy of statistical rigor. We contribute: (1) a benchmark of 1,730 summaries with realistic, non-obvious perturbations modeled after retracted papers; (2) the auditable Veritable taxonomy; and (3) an operational system that improves Macro F1 by at least 19 points over baselines using GPT-based SLMs, a result that replicates across MISTRAL and Gemma architectures. Given the complexity of detecting non-obvious flaws, we view VERIRAG as a "vulnerability-detection copilot," providing structured audit trails for human editors. In our experiments, individual human testers found over 80% of the generated audit trails useful for decision-making. We plan to release the dataset and code to support responsible science advocacy.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.94)
- Information Technology (0.70)
- Government (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
SkillFactory: Self-Distillation For Learning Cognitive Behaviors
Sprague, Zayne, Lu, Jack, Wadhwa, Manya, Keh, Sedrick, Ren, Mengye, Durrett, Greg
Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.
SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning
Rahman, Salman, Gorantla, Sruthi, Gupta, Arpit, Roy, Swastik, Peng, Nanyun, Liu, Yang
Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision--achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps in mathematical reasoning) compared to 66.4 for reference-guided training and 61.9 for GPT -4o. In the final stage, we apply our generative PRM with chain-of-thought verification (PRM-CoT) as the reward model in RL experiments on mathematical reasoning, and introduce format constraints to prevent reward hacking. Our work enables reference-free RL training that exceeds ground-truth methods, opening new possibilities for domains lacking verifiable answers or accessible ground truth. Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, from achieving gold-medal performance at the International Mathematical Olympiad to autonomous agentic coding (Castelvecchi, 2025; Luong & Lockhart, 2025; Y ang et al., 2024b; Hurst et al., 2024; Anthropic, 2025). Recent breakthroughs like OpenAI's o1 and DeepSeek's R1 demonstrate that reinforcement learning (RL) post-training can significantly enhance reasoning capabilities beyond supervised fine-tuning alone (Jaech et al., 2024; Guo et al., 2025), as RL enables models to explore diverse solution paths and learn from feedback rather than imitation (Chu et al., 2025). While RL post-training shows promise, current approaches rely on verifiers that require ground truth references. Traditional methods rely on either discriminative verifiers that provide binary correctness signals (Cobbe et al., 2021) or rule-based verifiers using exact answer matching (RL VR) (Guo et al., 2025; Hu et al., 2025), both offering only sparse, outcome-level rewards. Recent advances introduce Process Reward Models (PRMs) that provide denser, step-level feedback to improve training stability and credit assignment (Lightman et al., 2023; Wang et al., 2024; Uesato et al., 2022), including co-evolving approaches like T ANGO (Zha et al., 2025) and PRIME (Y uan et al., 2024) that jointly train the verifier alongside the policy model. Work done while as an intern at Amazon AGI. Stage III: Apply trained PRMs in RL with GRPO using different reward designs. PRIME requires outcome-level correctness labels to train its PRM (Zha et al., 2025; Y uan et al., 2024).
Randomized Controlled Trials for Phishing Triage Agent
Security operations centers (SOCs) face a persistent challenge: efficiently triaging a high volume of user-reported phishing emails while maintaining robust protection against threats. This paper presents the first randomized controlled trial (RCT) evaluating the impact of a domain-specific AI agent - the Microsoft Security Copilot Phishing Triage Agent - on analyst productivity and accuracy. Our results demonstrate that agent-augmented analysts achieved up to 6.5 times as many true positives per analyst minute and a 77% improvement in verdict accuracy compared to a control group. The agent's queue prioritization and verdict explanations were both significant drivers of efficiency. Behavioral analysis revealed that agent-augmented analysts reallocated their attention, spending 53% more time on malicious emails, and were not prone to rubber-stamping the agent's malicious verdicts. These findings offer actionable insights for SOC leaders considering AI adoption, including the potential for agents to fundamentally change the optimal allocation of SOC resources.
- Research Report > Strength High (1.00)
- Research Report > Experimental Study (1.00)
FinVet: A Collaborative Framework of RAG and External Fact-Checking Agents for Financial Misinformation Detection
Araya, Daniel Berhane, Liao, Duoduo
Financial markets face growing threats from misinformation that can trigger billions in losses in minutes. Most existing approaches lack transparency in their decision-making and provide limited attribution to credible sources. We introduce FinVet, a novel multi-agent framework that integrates two Retrieval-Augmented Generation (RAG) pipelines with external fact-checking through a confidence-weighted voting mechanism. FinVet employs adaptive three-tier processing that dynamically adjusts verification strategies based on retrieval confidence, from direct metadata extraction to hybrid reasoning to full model-based analysis. Unlike existing methods, FinVet provides evidence-backed verdicts, source attribution, confidence scores, and explicit uncertainty flags when evidence is insufficient. Experimental evaluation on the FinFact dataset shows that FinVet achieves an F1 score of 0.85, which is a 10.4% improvement over the best individual pipeline (fact-check pipeline) and 37% improvement over standalone RAG approaches.
- Asia > China (0.28)
- North America (0.28)
- Media > News (1.00)
- Law (1.00)
- Government (1.00)
- Banking & Finance > Trading (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)