Goto

Collaborating Authors

 Europe


CalArena: A Large-Scale Post-Hoc Calibration Benchmark

arXiv.org Machine Learning

Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post-hoc calibration provides a simple and widely used solution, but the large number of proposed methods, combined with small-scale and inconsistent evaluations, makes it difficult to determine which approaches are truly effective in practice. We introduce a large-scale, standardized benchmark for post-hoc calibration, covering nearly 2000 experiments across tabular and computer vision tasks, including binary, multiclass, and large-scale classification settings. Our benchmark aggregates predictions from a diverse set of classical models, modern deep learning architectures, and foundation models, and provides unified, reproducible implementations of dozens of calibration methods within a common evaluation framework. We argue that Post-Hoc Improvement (PHI) in proper scoring rules offers a principled alternative to traditional calibration error estimators for comparing post-hoc methods, capturing both calibration quality and potential degradation to the model's predictive performance. Using this framework, we conduct the most comprehensive empirical study of post-hoc calibration to date. Our results reveal consistent patterns across domains: smooth calibration functions outperform binning-based approaches, dedicated multiclass methods are essential in high-dimensional settings, and generic machine learning models are not competitive without calibration-specific design. To facilitate future research, we release all data, code, and evaluation tools, providing a plug-and-play benchmark for developing and comparing calibration methods.


On Language Generation in the Limit with Bounded Memory

arXiv.org Machine Learning

We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language one at a time and must eventually output only new valid examples. Prior work assumes access to the entire history, a strong assumption since realistic algorithms retain limited past information. Classical work in learning theory shows memory constraints dramatically alter learnability; we extend this to language generation. First, we study memoryless generators. Under a mild enumeration restriction, every countable collection of infinite languages remains generable without memory. Without this restriction, we exactly characterize when memoryless generation is possible. For finite collections, we characterize the optimal minimax density achievable by memoryless generators -- the best density guaranteed against any collection of a given size. This combinatorial bound relies on Sperner's theorem and symmetric chain decompositions. We further show that a sliding window of the last $W$ examples does not improve this worst-case density, whereas allowing it to store $b$ adaptively chosen past examples improves the achievable density for every $b \geq 1$. Finally, we revisit identification in the limit, where the learner must converge to a single correct hypothesis for the target language. We focus on its incremental variant, where the learner remembers only its previous guess. Here, although exact identification fails on a collection of just three languages, a mild relaxation requiring convergence to an ``approximate'' version of the target is achievable for every finite collection. These results show bounded memory affects these tasks differently: generation remains achievable for every countable collection, while density and identification are confined to finite collections, with guarantees weakening as the collection grows.


Reasoning with Sampling: Cutting at Decision Points

arXiv.org Machine Learning

Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power distribution, elicits comparable reasoning without additional training, curated datasets, or verifiers. However, making this method practical requires efficiently sampling from the power distribution. A sampler needs to "mix" to the power distribution, which necessitates moving between modes of the target distribution; intuitively, e.g., trying different reasoning strategies. The samplers proposed in prior works repeatedly select a "cut" position in the current reasoning trace uniformly at random and resample the suffix from that position onward. However, reasoning traces typically contain a few consequential decisions (e.g., the choice of proof strategy or algorithm), and we observe that a uniformly chosen cut tends to rewrite local details rather than revisit decision points. We introduce an algorithm (Entropy-Cut Metropolis-Hastings) that uses the base model's next-token entropy as a proxy to identify key decision points and resample from those positions. We empirically verify that entropy jumps are a useful proxy for decision points and, in a stylized model of reasoning, prove that our method's mixing time scales with the number of decisions in a trace rather than with the number of tokens, which can be much larger. Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.


AI facial recognition to check age of asylum seekers from next year

BBC News

An AI facial recognition tool that aims to detect adult migrants posing as children will be deployed at the UK's borders next year. A software company has been awarded a contract to develop and test the technology, which will estimate a person's age by analysing photographs of them taken at the border. The Home Office says the technology will make it easier to identify adult migrants attempting to game the system, after initial testing indicated promising performance and accuracy. But Human Rights Watch urged the government to scrap the scheme, describing it as unproven technology that will undermine the protections vulnerable children are entitled to. Unaccompanied child migrants are processed through the care system rather than the asylum system, which can make it easier to stay in the country.


The Internet Is Somehow Obsessed With the Pope's First Major Letter. I Read It--and Totally See Why.

Slate

Users I Read the Pope's Encyclical on A.I. I'm Astounded By What He Wrote. It's an urgent warning--and a celebration of humanity and what we can do at our best. Enter your email to receive alerts for this author. You can manage your newsletter subscriptions at any time. You're already subscribed to the aa_Nitish_Pahwa newsletter.


Weekly quiz: Which tennis star dazzled the French Open with an 'Eiffel Tower' dress?

BBC News

Weekly quiz: Which tennis star dazzled the French Open with an'Eiffel Tower' dress? This week, more details about the Married At First Sight UK scandal came to light, former SNP chief executive Peter Murrell admitted embezzling more than £400,000 from the party, and almost 90 drones crashed into Sydney's Darling Harbour when a light show went wrong . But how much attention did you pay to what else happened in the world over the past seven days? Try last week's quiz, or have a go at something from the archives . Paris'punishingly hot' as Western Europe hit by heatwave Timelapse footage shows'giant cave' inflating on Paris bridge The BBC is not responsible for the content of external sites.


Latvia parliament approves new gov't after drone dispute toppled coalition

Al Jazeera

Latvia parliament approves new gov't after drone dispute toppled coalition Latvia's parliament has approved a new coalition government that will lead the European Union and NATO member country in the coming months after its predecessor collapsed following an argument over its handling of stray drones suspected to be from Ukraine. By a margin of 66 deputies in the 100-seat assembly, lawmakers on Thursday confirmed 47-year-old centrist Andris Kulbergs as prime minister, who will lead the Baltic nation of more than 1.8 million people until parliamentary elections on October 3. She quit after Defence Minister Andris Spruds, a member of the Progressives Party, was forced to resign over the government's handling of multiple incidents involving stray drones suspected to be from Ukraine crossing into Latvian territory. Silina accused the minister of not deploying anti-drone defences fast enough to parry two wayward Ukraine attack drones, which are thought to have been knocked off course by Russian jamming. At the time, she said Spruds had lost her trust and that of the public.


Image of Thai police in sparkly dresses with handcuffed suspect turns out to be AI fake

The Guardian

The real image, which the police station has since shared, shows the officers in normal clothes and no female officer in the picture at all. The real image, which the police station has since shared, shows the officers in normal clothes and no female officer in the picture at all. Picture was created by administrator in charge of station's Facebook account who wanted to create'friendlier image' It was an arresting image and an irresistible story. A group of tough Thai police officers - five men and one woman - all wearing elaborate festival-style dresses, surrounding a drug dealer they had caught while undercover. The image, released by local police, was so compelling that it found its way on to the front page of the UK's Daily Star, as well as in picture stories in the Telegraph, the Sun and the New York Post. The Sun wrote: "The burly crew of five men and one woman slipped into skin tight sequins and feathers for the covert mission in Thailand ."


Mathematical AI helps researchers crack 50-year-old problem

New Scientist

Just a week after an AI disproved an 80-year-old conjecture and astonished mathematicians, another conjecture that had stood for half a century has fallen, inspired by the same techniques, but this time written entirely by humans. Last week, an unreleased AI model from OpenAI disproved an important conjecture first posed by Hungarian mathematician Paul Erdős, called the unit distance problem. The puzzle, which Erdős considered his "most striking contribution to geometry" and which many mathematicians had failed to unravel, concerns the number of similar-sized connections you can make between dots arranged on a flat surface. Erdős had set an upper ceiling on this number, which many experts had assumed was correct. But the AI model showed that this number could in fact be much larger, using an obscure trick from algebraic number theory to make complex structures with extremely high dimensions, which could then be used to arrange the dots in a very different arrangement than humans had considered.


Irish datacentres have increased household bills by hundreds of euros, report finds

The Guardian

Datacentre industry representatives disputed the findings and said the sector boosted the economy. Datacentre industry representatives disputed the findings and said the sector boosted the economy. 'Hidden datacentre tax' costing Irish households millions, report says Datacentres used 22% of country's electricity last year, pushing up household bills, study suggests Thu 28 May 2026 09.01 EDTLast modified on Thu 28 May 2026 09.32 EDT Energy demand by datacentres in Ireland has added hundreds of euros to household electricity bills in a pattern that could be replicated across Europe, according to a report. Ireland's growing number of datacentres last year used 22% of the country's electricity, more than all urban homes combined, according to the Central Statistics Office. The equivalent figure in the US and UK is 6%.