Goto

Collaborating Authors

 almanac


The 'Farmer's Almanac' says goodbye after 208 years

Popular Science

Environment Agriculture The'Farmer's Almanac' says goodbye after 208 years The 2026 edition will be its last. Breakthroughs, discoveries, and DIY tips sent every weekday. After more than 200 years of weather wisdom, folklore, and time-tested advice, editors announced that the 2026 will be the last edition. The website will remain operational through the end of December 2025. "Many of you grew up hearing your parents or grandparents quote from the, always having a copy nearby. Maybe you have planted by our Moon phases, consulted the for the'Best Days' to potty train, wean, or go fishing," Editor Sandi Duncan and Editor Emeritus Peter Geiger wrote in the announcement.


August stargazing: The Perseids, a 'big fish,' celestial conjunctions, and more

Popular Science

Breakthroughs, discoveries, and DIY tips sent every weekday. As any diligent stargazer knows, mid-summer means one thing: the Perseids! This meteor shower hits its peak on August 12 this year, and while that date is inconveniently close to that of this month's full moon, there should still be plenty of meteors on show for those who choose their time and location with care. As another long summer day has finally receded into another summer night, look east. If the sky is clear, you might well spy the Summer Triangle.


ALMANACS: A Simulatability Benchmark for Language Model Explainability

Mills, Edmund, Su, Shiye, Russell, Stuart, Emmons, Scott

arXiv.org Machine Learning

How do we measure the efficacy of language model explainability methods? While many explainability methods have been developed, they are typically evaluated on bespoke tasks, preventing an apples-to-apples comparison. To help fill this gap, we present ALMANACS, a language model explainability benchmark. ALMANACS scores explainability methods on simulatability, i.e., how well the explanations improve behavior prediction on new inputs. The ALMANACS scenarios span twelve safety-relevant topics such as ethical reasoning and advanced AI behaviors; they have idiosyncratic premises to invoke model-specific behavior; and they have a train-test distributional shift to encourage faithful explanations. By using another language model to predict behavior based on the explanations, ALMANACS is a fully automated benchmark. We use ALMANACS to evaluate counterfactuals, rationalizations, attention, and Integrated Gradients explanations. Our results are sobering: when averaged across all topics, no explanation method outperforms the explanation-free control. We conclude that despite modest successes in prior work, developing an explanation method that aids simulatability in ALMANACS remains an open challenge. Understanding the behavior of deep neural networks is critical for their safe deployment. While deep neural networks are a black box by default, a wide variety of interpretability methods are being developed to explain their behavior (Räuker et al., 2023; Nauta et al., 2022). Some approaches, such as LIME (Ribeiro et al., 2016) and MUSE (Lakkaraju et al., 2019), try to approximate output behavior. Other approaches try to mechanistically explain the circuits inside a network (Nanda et al., 2023; Wang et al., 2023). Some approaches imitate explanations in the training data (Camburu et al., 2018; Narang et al., 2020; Marasović et al., 2022). Other approaches study the network's activations, such as a transformer's attention over its input (Serrano & Smith, 2019; Wiegreffe & Pinter, 2019). Others aim to create neural networks that are intrinsically explainable (Jain et al., 2020). With so many interpretability methods to choose from, how can we tell which one works best? Despite years of work in the field, there is no consistent evaluation standard. New interpretability papers generally test their methods on bespoke tasks, making it difficult to assess their true effectiveness. To solve this issue, Doshi-Velez & Kim (2017), Nauta et al. (2022), and Räuker et al. (2023) argue that we need standard interpretability benchmarks. Just as benchmarks have driven progress in computer vision (Deng et al., 2009), natural language processing (Wang et al., 2019b;a), and reinforcement learning (Brockman et al., 2016; Tunyasuvunakool et al., 2020), we seek to drive progress in interpretability by enabling apples-to-apples comparisons across diverse methods.


Almanac: Retrieval-Augmented Language Models for Clinical Medicine

Zakka, Cyril, Chaurasia, Akash, Shad, Rohan, Dalal, Alex R., Kim, Jennifer L., Moor, Michael, Alexander, Kevin, Ashley, Euan, Boyd, Jack, Boyd, Kathleen, Hirsch, Karen, Langlotz, Curt, Nelson, Joanna, Hiesinger, William

arXiv.org Artificial Intelligence

In recent years, language model pre-training has emerged as a powerful training paradigm in natural language processing (NLP) [1-4]. For a large number of these language models, performance improvements have been empirically observed to scale with model and dataset size, with the well-documented emergence of zero-shot capabilities and sample efficiency on a range of downstream NLP tasks [5-7]. However, due the nature of their training objective-- predicting the next token in a sentence--large language models (LLMs) can be prone to generating factually incorrect statements, a phenomenon commonly known as hallucination [8, 9]. More contentiously, many works have also demonstrated these models' ability to reproduce social biases, as well as generating statements reinforcing gender, racial, and religious stereotypes [10, 11]. In an effort to reduce these unwanted behaviors, several works have explored different ways of steering LLM outputs to more closely align with user-intent, including fine-tuning with human feedback [12, 13] and natural language prompt engineering [14, 15].


La Niña Effects? National Weather Service Predicts 2017 Winter Climate

International Business Times

The National Weather Service has released its first winter weather predictions for the approaching season in the United States. But if the wildcard La Niña develops, it might shake some things up. The chances that it will develop are strong too, observations as well as computer models suggest that La Niña is likely to develop. If it does develop, Mike Halpert, the deputy director of the Climate prediction Center at the National Oceanic and Atmospheric Administration predicts that it will be "weak and potentially short-lived." Colder than normal conditions in the Pacific Ocean near the equator is what is commonly referred to as La Niña.