Goto

Collaborating Authors

 biography


Our Greatest Living Biographer Is Back With His First Single-Subject Book in Decades. It's Enthralling.

Slate

Richard Holmes, our greatest living biographer, is back with an enthralling chronicle of the poet. Enter your email to receive alerts for this author. You can manage your newsletter subscriptions at any time. You're already subscribed to the aa_Laura_Miller newsletter. You can manage your newsletter subscriptions at any time.


Engaging look at friction shows how it keeps our world rubbing along

New Scientist

How much do you know about friction? Jennifer R. Vail's charming, if sometimes technical, biography of the force showcases its amazing and largely overlooked role in everything from climate change to dark matter, says Karmela Padavic-Callaghan IN 2009, World Aquatics banned a specific type of swimsuit from all international competitions in water sports, ruling that it gave athletes an unfair advantage. The development of this swimsuit included using NASA's testing facilities and sophisticated computer software. Some versions had ultrasonically welded seams instead of traditional stitches. Swimmers who wore the suit broke 23 of the 25 world records set at the Beijing Olympics in 2008.



Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Gu, Xinran, Lyu, Kaifeng, Li, Jiazheng, Zhang, Jingzhao

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.



Hubble: a Model Suite to Advance the Study of LLM Memorization

Wei, Johnny Tian-Zheng, Godbole, Ameya, Khan, Mohammad Aflah, Wang, Ryan, Zhu, Xiaoyuan, Flemings, James, Kashyap, Nitya, Gummadi, Krishna P., Neiswanger, Willie, Jia, Robin

arXiv.org Artificial Intelligence

We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models -- standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens -- establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing that sensitive data without continued exposure can be forgotten. These findings suggest two best practices for addressing memorization risks: to dilute sensitive data by increasing the size of the training corpus, and to order sensitive data to appear earlier in training. Beyond these general empirical findings, Hubble enables a broad range of memorization research; for example, analyzing the biographies reveals how readily different types of private information are memorized. We also demonstrate that the randomized insertions in Hubble make it an ideal testbed for membership inference and machine unlearning, and invite the community to further explore, benchmark, and build upon our work.


A Controllable Examination for Long-Context Language Models

Yang, Yijun, Huang, Zeyu, Zhu, Wenhao, Qiu, Zihan, Yuan, Fei, Pan, Jeff Z., Titov, Ivan

arXiv.org Artificial Intelligence

Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world applications (e.g, document summarization) and synthetic tasks (e.g, needle-in-a-haystack). Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks often involve complexity that makes interpretation challenging and suffer from data contamination, whereas synthetic tasks frequently lack meaningful coherence between the target information (needle) and its surrounding context (haystack), undermining their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: 1) seamless context 2) controllable setting and 3) sound evaluation. This study introduces $\textbf{LongBioBench}$, a benchmark that utilizes artificially generated biographies as a controlled environment for assessing LCLMs across dimensions of understanding, reasoning, and trustworthiness. Our experimental evaluation, which includes 18 LCLMs in total, demonstrates that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results and are less trustworthy as context length increases. Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence, numerical needles, and the absence of distractors, rendering them vulnerable to test the model's long-context capabilities. To sum up, compared to previous synthetic benchmarks, LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability, and is highly interpretable and configurable.


Letters from Our Readers

The New Yorker

Readers respond to Anthony Lane's essay about Christopher Marlowe, Lauren Collins's report on Uniqlo, and Dhruv Khullar's article about A.I. and medical diagnosis. I very much enjoyed Anthony Lane's gleeful review of Stephen Greenblatt's new biography of Christopher Marlowe (Books, September 15th). Lane reminds us that Marlowe took the plot of his play "Dido, Queen of Carthage" from Virgil's Aeneid. I'm not convinced, though, that Virgil would "blench" at Marlowe's opening scene, where a lecherous Jupiter entertains Ganymede, a boy, on his knee. Have another look at the opening verses of the Aeneid (especially Book I, line 28).



ADAM: A Diverse Archive of Mankind for Evaluating and Enhancing LLMs in Biographical Reasoning

Cekinmez, Jasin, Ghahroodi, Omid, Chandle, Saad Fowad, Gupta, Dhiman, Asgari, Ehsaneddin

arXiv.org Artificial Intelligence

We introduce ADAM (A Diverse Archive of Mankind), a framework for evaluating and improving multimodal large language models (MLLMs) in biographical reasoning. To the best of our knowledge, this is the first work to systematically examine LLM capabilities in biography, a critical yet underexplored dimension of factual knowledge. At its core, AdamDB is a multilingual and multimodal dataset covering over 4 million individuals across geography, time, and profession, while AdamBench provides cognitively structured evaluations based on Bloom's taxonomy, spanning six reasoning levels in both English and native languages. To address hallucinations, particularly for lesser-known individuals, we propose AdamRAG, a retrieval-augmented generation system tailored to biographical contexts. Experiments show that AdamRAG substantially improves open-source models and modestly benefits closed-source ones, with the largest gains on lower-order reasoning. Popularity strongly mediates accuracy, and multimodal input via face images offers smaller, less consistent improvements than retrieval. ADAM establishes the first benchmark and framework for cognitively, culturally, and multimodally grounded biographical evaluation, advancing the development of multilingual, accurate, and hallucination-resistant MLLMs.