Goto

Collaborating Authors

 Large Language Model


Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text

arXiv.org Machine Learning

Modern large language models (LLMs) such as GPT, Claude, and Gemini have transformed the way we learn, work, and communicate. Y et, their ability to produce highly human-like text raises serious concerns about misinformation and academic integrity, making it an urgent need for reliable algorithms to detect LLMgenerated content. In this paper, we start by presenting a geometric approach to demystify rewrite-based detection algorithms, revealing their underlying rationale and demonstrating their generalization ability. Building on this insight, we introduce a novel rewrite-based detection algorithm that adaptively learns the distance between the original and rewritten text. Theoretically, we demonstrate that employing an adaptively learned distance function is more effective for detection than using a fixed distance. Empirically, we conduct extensive experiments with over 100 settings, and find that our approach demonstrates superior performance over baseline algorithms in the majority of scenarios. In particular, it achieves relative improvements from 57.8% to 80.6% over the strongest baseline across different target LLMs (e.g., GPT, Claude, and Gemini). The past few years have witnessed the emergence and rapid development of large language models (LLMs) such as GPT (Hurst et al., 2024), DeepSeek (Liu et al., 2024), Claude (Anthropic, 2024), Gemini (Comanici et al., 2025), Grok (xAI, 2025) and Qwen (Y ang et al., 2025). Their impact is everywhere, from education, academia and software development to healthcare and everyday life (Arora & Arora, 2023; Chan & Hu, 2023; Hou et al., 2024). On one side of the coin, LLMs can support users with conversational question answering, help students learn more effectively, draft emails, write computer code, prepare presentation slides and more. On the other side, their ability to closely mimic human-written text also raises serious concerns, including the generation of biased or harmful content, the spread of misinformation in the news ecosystem, and the challenges related to authorship attribution and intellectual property (Dave et al., 2023; Fang et al., 2024; Messeri & Crockett, 2024; Mahajan et al., 2025; Laurito et al., 2025). Addressing these concerns requires effective algorithms to distinguish between human-written and LLM-generated text, which has become an active and popular research direction in recent literature (see Crothers et al., 2023; Wu et al., 2025, for reviews).


A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth

arXiv.org Machine Learning

Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates. More data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across multiple public benchmarks and a newly collected dataset, our method improves agreement with human preferences, achieves higher data efficiency than unweighted baselines, and produces calibrated uncertainty quantification for LLM rankings.


More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)

arXiv.org Machine Learning

The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a sublinear growth of the coverage@cost (diminishing returns). To solve this problem, we propose Reset-and-Discard (ReD), a query method of LLMs that increases coverage@cost for any given budget, regardless of the pass@k form. Moreover, given a pass@k, we can quantitatively predict the savings in the total number of attempts using ReD. If pass@k is not available for the model, ReD can infer its power-law exponent. Experiments on three LLMs using HumanEval demonstrate that ReD substantially reduces the required attempts, tokens, and USD cost to reach a desired coverage, while also offering an efficient way to measure inference power-laws.


Statsformer: Validated Ensemble Learning with LLM-Derived Semantic Priors

arXiv.org Machine Learning

We introduce Statsformer, a principled framework for integrating large language model (LLM)-derived knowledge into supervised statistical learning. Existing approaches are limited in adaptability and scope: they either inject LLM guidance as an unvalidated heuristic, which is sensitive to LLM hallucination, or embed semantic information within a single fixed learner. Statsformer overcomes both limitations through a guardrailed ensemble architecture. We embed LLM-derived feature priors within an ensemble of linear and nonlinear learners, adaptively calibrating their influence via cross-validation. This design yields a flexible system with an oracle-style guarantee that it performs no worse than any convex combination of its in-library base learners, up to statistical error. Empirically, informative priors yield consistent performance improvements, while uninformative or misspecified LLM guidance is automatically downweighted, mitigating the impact of hallucinations across a diverse range of prediction tasks.


'Uncanny Valley': Minneapolis Misinformation, TikTok's New Owners, and Moltbot Hype

WIRED

We'll link all the stories we spoke about today in the show notes. Adriana Tapia produced this episode, Amar Lal at Macro Sound mixed this episode, Matt Giles and Daniel Roman fact-checked this episode, Mark Leyda was our San Francisco studio engineer, Kate Osborn is our executive producer, and Katie Drummond is WIRED's global editorial director.


Microsoft stock plunges as Wall Street questions AI investments

Al Jazeera

Microsoft stock has slumped 12 percent as part of a software industry sell-off, stoking fears of whether hefty investments in artificial intelligence will pay off across the sector. The Redmond, Washington-based tech giant is on track Thursday to finish at its worst day since March 2020 and has seen approximately $400bn in valuation wiped out. Capital expenditures grew by 66 percent in the second quarter compared with the same period the year before, reaching a record $37.5bn for the quarter. Meanwhile, Microsoft predicted Azure growth to stay stable in the period from January to March at 37 percent to 38 percent, after slowing in the last three months of 2025, partially due to AI chip capacity constraints. "[Wall Street] wanted to see less cap-ex spending and faster cloud/AI monetisation and coming out of the gates, it's the opposite. We have said this is a multi-year journey, and Redmond needs to focus on its data center buildout with more customers heading down the AI path. It's a balancing act with 2026 the inflection year for AI and MSFT [Microsoft]," Dan Ives, analyst at Wedbush Securities, said in a note provided to Al Jazeera.


The AI Hype Index: Grok makes porn, and Claude Code nails your job

MIT Technology Review

Everyone is panicking because AI is very bad; everyone is panicking because AI is very good. It's just that you never know which one you're going to get. Grok is a pornography machine. Claude Code can do anything from building websites to reading your MRI. So of course Gen Z is spooked by what this means for jobs. Unnerving new research says AI is going to have a seismic impact on the labor market this year.


Big tech results show investor demand for payoffs from heavy AI spending

The Guardian

Meta wowed Wall Street with improvements in ad targeting fueled by AI alongside huge investment. Big tech earnings so far this week have sent a clear warning: investors are willing to overlook soaring spending on artificial intelligence if it fuels strong growth, but are quick to punish companies that fall short. The contrast was clear in Thursday's stock market reaction to earnings from Microsoft and Meta, highlighting how dramatically the stakes have changed since the launch of ChatGPT started the AI boom more than three years ago. Shares of the Instagram parent surged more than 9% on strong sales, while those of Microsoft slumped 10% after its cloud business failed to impress. "The market appears to be questioning whether these massive capital expenditure hikes will generate sufficient returns," said Jesse Cohen, senior analyst at Investing.com.


A Yann LeCun–Linked Startup Charts a New Path to AGI

WIRED

As the world's largest companies pour hundreds of billions of dollars into large language models, San Francisco-based Logical Intelligence is trying something different in pursuit of AI that can mimic the human brain. If you ask Yann LeCun, Silicon Valley has a groupthink problem. Since leaving Meta in November, the researcher and AI luminary has taken aim at the orthodox view that large language models (LLMs) will get us to artificial general intelligence (AGI), the threshold where computers match or exceed human smarts. Everyone, he declared in a recent interview, has been "LLM-pilled." On January 21, San Francisco-based startup Logical Intelligence appointed LeCun to its board .


DHS is using Google and Adobe AI to make videos

MIT Technology Review

Immigration agencies have been flooding social media with bizarre, seemingly AI-generated content. We now know more about what might be making it. The US Department of Homeland Security is using AI video generators from Google and Adobe to make and edit content shared with the public, a new document reveals. It comes as immigration agencies have flooded social media with content to support President Trump's mass deportation agenda--some of which appears to be made with AI--and as workers in tech have put pressure on their employers to denounce the agencies' activities. The document, released on Wednesday, provides an inventory of which commercial AI tools DHS uses for tasks ranging from generating drafts of documents to managing cybersecurity. In a section about "editing images, videos or other public affairs materials using AI," it reveals for the first time that DHS is using Google's Veo 3 video generator and Adobe Firefly, estimating that the agency has between 100 and 1,000 licenses for the tools.