Goto

Collaborating Authors

 Media


What lies beneath: Scientists discover a giant granite slab half the size of WALES hidden under the West Antarctic Ice Sheet

Daily Mail - Science & tech

Melania Trump accused of'calculated campaign to destroy' notorious biographer in lawsuit claiming she sabotaged tell-all on First Lady Young Americans identifying as trans or nonbinary in FREEFALL as experts pinpoint what's behind the shift Prince Andrew will be summoned to give evidence on Jeffrey Epstein to US Congress committee as victim says shamed royal should'do right' by Virginia Guiffre and testify What Britney Spears is really like behind closed doors: For first time, Kevin Federline reveals secrets he refused to spill even for $1 million... including'terrifying' acts that left their children running to him The real story behind Jim Carrey's disappearance: He once made $20m per film. Now insiders tell TOM LEONARD about the mysterious suicide of his married lover and claims of autism'cure' at the heart of his Hollywood downfall Is Meghan about to launch a new'Kardashian-style' mega brand? Duchess cosies up to CEO behind Kim Kardashian's wildly successful Skims range as speculation about her new venture grows Women's tennis in'manliness' row: World's No 1 and 2 come under fire from rival for their'high testosterone' - before Aryna Sabalenka appears to fire back after being labelled a'big' player Harvey Weinstein's ex-wife Georgina Chapman is facing foreclosure on $2.5 million NYC home Suzanne Somers' widower shocks fans as he resurrects star in'AI clone' format: 'You can't tell the difference' Vicious catfight erupts between Trump's leading ladies. Feud is talk of White House: 'It's real and it's personal' Karoline Leavitt goes scorched earth on'bitter' Biden press secretary over'deplorable' comments Three brutal words in my best friend's wedding invite cut like a knife. Meghan's hit a trashy new low.


AI models misrepresent news events nearly half the time, study says

Al Jazeera

AI models such as ChatGPT routinely misrepresent news events, providing faulty responses to questions almost half the time, a study has found. The study published on Wednesday by the European Broadcasting Union (EBU) and the BBC assessed the accuracy of more than 2,700 responses given by OpenAI's ChatGPT, Google's Gemini, Microsoft's Copilot, and Perplexity. Overall, 45 percent of responses had at least one "significant" issue, according to the research. Sourcing was the most common problem, with 31 percent of responses including information not supported by the cited source, or incorrect or unverifiable attribution, among other issues. A lack of accuracy was the next biggest contributor to faulty answers, affecting 20 percent of responses, followed by the absence of appropriate context, with 14 percent.


Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models

arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated remarkable reasoning and planning capabilities, driving extensive research into task decomposition. Existing task decomposition methods focus primarily on memory, tool usage, and feedback mechanisms, achieving notable success in specific domains, but they often overlook the trade-off between performance and cost. In this study, we first conduct a comprehensive investigation on task decomposition, identifying six categorization schemes. Then, we perform an empirical analysis of three factors that influence the performance and cost of task decomposition: categories of approaches, characteristics of tasks, and configuration of decomposition and execution models, uncovering three critical insights and summarizing a set of practical principles. Building on this analysis, we propose the Select-Then-Decompose strategy, which establishes a closed-loop problem-solving process composed of three stages: selection, execution, and verification. This strategy dynamically selects the most suitable decomposition approach based on task characteristics and enhances the reliability of the results through a verification module. Comprehensive evaluations across multiple benchmarks show that the Select-Then-Decompose consistently lies on the Pareto frontier, demonstrating an optimal balance between performance and cost. Our code is publicly available at https://github.com/summervvind/Select-Then-Decompose.


Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability

arXiv.org Artificial Intelligence

The proliferation of misinformation necessitates scalable, automated fact-checking solutions. Yet, current benchmarks often overlook multilingual and topical diversity. This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024. Through a comprehensive evaluation of five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B, we identify significant performance gaps between different languages and topics. While overall GPT-4o achieves the highest accuracy, it declines to classify 43% of claims. Across all models, factual-sounding claims are misclassified more often than opinions, revealing a key vulnerability. These findings underscore the need for caution and highlight challenges in deploying LLM-based fact-checking systems at scale. To whom correspondence should be addressed: lorraine.saju@gesis.org


Improving the fact-checking performance of language models by relying on their entailment ability

arXiv.org Artificial Intelligence

Automated fact-checking has been a challenging task for the research community. Past works tried various strategies, such as end-to-end training, retrieval-augmented generation, and prompt engineering, to build robust fact-checking systems. However, their accuracy has not been very high for real-world deployment. We, on the other hand, propose a simple yet effective strategy, where entailed justifications generated by LLMs are used to train encoder-only language models (ELMs) for fact-checking. We conducted a rigorous set of experiments, comparing our approach with recent works and various prompting and fine-tuning strategies to demonstrate the superiority of our approach. Additionally, we did quality analysis of model explanations, ablation studies, and error analysis to provide a comprehensive understanding of our approach.


Crucible: Quantifying the Potential of Control Algorithms through LLM Agents

arXiv.org Artificial Intelligence

Control algorithms in production environments typically require domain experts to tune their parameters and logic for specific scenarios. However, existing research predominantly focuses on algorithmic performance under ideal or default configurations, overlooking the critical aspect of Tuning Potential. To bridge this gap, we introduce Crucible, an agent that employs an LLM-driven, multi-level expert simulation to turn algorithms and defines a formalized metric to quantitatively evaluate their Tuning Potential. We demonstrate Crucible's effectiveness across a wide spectrum of case studies, from classic control tasks to complex computer systems, and validate its findings in a real-world deployment. Our experimental results reveal that Crucible systematically quantifies the tunable space across different algorithms. Furthermore, Crucible provides a new dimension for algorithm analysis and design, which ultimately leads to performance improvements. Our code is available at https://github.com/thu-media/Crucible.


Coverage-Recon: Coordinated Multi-Drone Image Sampling with Online Map Feedback

arXiv.org Artificial Intelligence

Achieving high-quality reconstruction requires capturing images of keypoints within the target scene from diverse viewing angles, and coverage control offers an effective framework to meet this requirement. Meanwhile, recent advances in real-time 3D reconstruction algorithms make it possible to render an evolving map during flight, enabling immediate feedback to guide drone motion. Building on this, we present Coverage-Recon, a novel coordinated image sampling algorithm that integrates online map feedback to improve reconstruction quality on-the-fly. In Coverage-Recon, the coordinated motion of drones is governed by a Quadratic Programming (QP)-based angle-aware coverage controller, which ensures multi-viewpoint image capture while enforcing safety constraints. The captured images are processed in real time by the NeuralRecon algorithm to generate an evolving 3D mesh. Mesh changes across the scene are interpreted as indicators of reconstruction uncertainty and serve as feedback to update the importance index of the coverage control as the map evolves. The effectiveness of Coverage-Recon is validated through simulation and experiments, demonstrating both qualitatively and quantitatively that incorporating online map feedback yields more complete and accurate 3D reconstructions than conventional methods.


MARCUS: An Event-Centric NLP Pipeline that generates Character Arcs from Narratives

arXiv.org Artificial Intelligence

Character arcs are important theoretical devices employed in literary studies to understand character journeys, identify tropes across literary genres, and establish similarities between narratives. This work addresses the novel task of computationally generating event-centric, relation-based character arcs from narratives. Providing a quantitative representation for arcs brings tangibility to a theoretical concept and paves the way for subsequent applications. We present MARCUS (Modelling Arcs for Understanding Stories), an NLP pipeline that extracts events, participant characters, implied emotion, and sentiment to model inter-character relations. MARCUS tracks and aggregates these relations across the narrative to generate character arcs as graphical plots. We generate character arcs from two extended fantasy series, Harry Potter and Lord of the Rings. We evaluate our approach before outlining existing challenges, suggesting applications of our pipeline, and discussing future work.


Joint Estimation of Piano Dynamics and Metrical Structure with a Multi-task Multi-Scale Network

arXiv.org Artificial Intelligence

Estimating piano dynamic from audio recordings is a fundamental challenge in computational music analysis. In this paper, we propose an efficient multi-task network that jointly predicts dynamic levels, change points, beats, and downbeats from a shared latent representation. These four targets form the metrical structure of dynamics in the music score. Inspired by recent vocal dynamic research, we use a multi-scale network as the backbone, which takes Bark-scale specific loudness as the input feature. Compared to log-Mel as input, this reduces model size from 14.7 M to 0.5 M, enabling long sequential input. We use a 60-second audio length in audio segmentation, which doubled the length of beat tracking commonly used. Evaluated on the public MazurkaBL dataset, our model achieves state-of-the-art results across all tasks. This work sets a new benchmark for piano dynamic estimation and delivers a powerful and compact tool, paving the way for large-scale, resource-efficient analysis of musical expression.


Towards Agentic Self-Learning LLMs in Search Environment

arXiv.org Artificial Intelligence

We study whether self-learning can scale LLM-based agents without relying on human-curated datasets or predefined rule-based rewards. Through controlled experiments in a search-agent setting, we identify two key determinants of scalable agent training: the source of reward signals and the scale of agent task data. We find that rewards from a Generative Reward Model (GRM) outperform rigid rule-based signals for open-domain learning, and that co-evolving the GRM with the policy further boosts performance. Increasing the volume of agent task data-even when synthetically generated-substantially enhances agentic capabilities. Building on these insights, we propose \textbf{Agentic Self-Learning} (ASL), a fully closed-loop, multi-role reinforcement learning framework that unifies task generation, policy execution, and evaluation within a shared tool environment and LLM backbone. ASL coordinates a Prompt Generator, a Policy Model, and a Generative Reward Model to form a virtuous cycle of harder task setting, sharper verification, and stronger solving. Empirically, ASL delivers steady, round-over-round gains, surpasses strong RLVR baselines (e.g., Search-R1) that plateau or degrade, and continues improving under zero-labeled-data conditions, indicating superior sample efficiency and robustness. We further show that GRM verification capacity is the main bottleneck: if frozen, it induces reward hacking and stalls progress; continual GRM training on the evolving data distribution mitigates this, and a small late-stage injection of real verification data raises the performance ceiling. This work establishes reward source and data scale as critical levers for open-domain agent learning and demonstrates the efficacy of multi-role co-evolution for scalable, self-improving agents. The data and code of this paper are released at https://github.com/forangel2014/Towards-Agentic-Self-Learning