Accomazzi, Alberto
AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy
Pan, Rui, Nguyen, Tuan Dung, Arora, Hardik, Accomazzi, Alberto, Ghosal, Tirthankar, Ting, Yuan-Sen
Continual pretraining of large language models on domain-specific data has been proposed to enhance performance on downstream tasks. In astronomy, the previous absence of astronomy-focused benchmarks has hindered objective evaluation of these specialized LLM models. Leveraging a recent initiative to curate high-quality astronomical MCQs, this study aims to quantitatively assess specialized LLMs in astronomy. We find that the previously released AstroLLaMA series, based on LLaMA-2-7B, underperforms compared to the base model. We demonstrate that this performance degradation can be partially mitigated by utilizing high-quality data for continual pretraining, such as summarized text from arXiv. Despite the observed catastrophic forgetting in smaller models, our results indicate that continual pretraining on the 70B model can yield significant improvements. However, the current supervised fine-tuning dataset still constrains the performance of instruct models. In conjunction with this study, we introduce a new set of models, AstroLLaMA-3-8B and AstroLLaMA-2-70B, building upon the previous AstroLLaMA series.
AstroMLab 1: Who Wins Astronomy Jeopardy!?
Ting, Yuan-Sen, Nguyen, Tuan Dung, Ghosal, Tirthankar, Pan, Rui, Arora, Hardik, Sun, Zechang, de Haan, Tijmen, Ramachandra, Nesar, Wells, Azton, Madireddy, Sandeep, Accomazzi, Alberto
We present a comprehensive evaluation of proprietary and open-weights large language models using the first astronomy-specific benchmarking dataset. This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics, covering a broad range of astrophysical topics. Our analysis examines model performance across various astronomical subfields and assesses response calibration, crucial for potential deployment in research environments. Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy. For proprietary models, we observed a universal reduction in cost every 3-to-12 months to achieve similar score in this particular astronomy benchmark. Open-source models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models. We identify performance variations across topics, with non-English-focused models generally struggling more in exoplanet-related fields, stellar astrophysics, and instrumentation related questions. These challenges likely stem from less abundant training data, limited historical context, and rapid recent developments in these areas. This pattern is observed across both open-weights and proprietary models, with regional dependencies evident, highlighting the impact of training data diversity on model performance in specialized scientific domains. Top-performing models demonstrate well-calibrated confidence, with correlations above 0.9 between confidence and correctness, though they tend to be slightly underconfident. The development for fast, low-cost inference of open-weights models presents new opportunities for affordable deployment in astronomy. The rapid progress observed suggests that LLM-driven research in astronomy may become feasible in the near future.
INDUS: Effective and Efficient Language Models for Scientific Applications
Bhattacharjee, Bishwaranjan, Trivedi, Aashka, Muraoka, Masayasu, Ramasubramanian, Muthukumaran, Udagawa, Takuma, Gurung, Iksha, Zhang, Rong, Dandala, Bharath, Ramachandran, Rahul, Maskey, Manil, Bugbee, Kaylin, Little, Mike, Fancher, Elizabeth, Sanders, Lauren, Costes, Sylvain, Blanco-Cuaresma, Sergi, Lockhart, Kelly, Allen, Thomas, Grezes, Felix, Ansdell, Megan, Accomazzi, Alberto, El-Kurdi, Yousef, Wertheimer, Davis, Pfitzmann, Birgit, Ramis, Cesar Berrospi, Dolfi, Michele, de Lima, Rafael Teixeira, Vagenas, Panagiotis, Mukkavilli, S. Karthik, Staar, Peter, Vahidinia, Sanaz, McGranaghan, Ryan, Mehrabian, Armin, Lee, Tsendgar
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this pivotal insight, we developed INDUS, a comprehensive suite of LLMs tailored for the Earth science, biology, physics, heliophysics, planetary sciences and astrophysics domains and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address natural language understanding tasks, (2) a contrastive-learning-based general text embedding model trained using a diverse set of datasets drawn from multiple sources to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation techniques to address applications which have latency or resource constraints. We also created three new scientific benchmark datasets namely, CLIMATE-CHANGE-NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. Finally, we show that our models outperform both general-purpose encoders (RoBERTa) and existing domain-specific encoders (SciBERT) on these new tasks as well as existing benchmark tasks in the domains of interest.
Experimenting with Large Language Models and vector embeddings in NASA SciX
Blanco-Cuaresma, Sergi, Ciucฤ, Ioana, Accomazzi, Alberto, Kurtz, Michael J., Henneken, Edwin A., Lockhart, Kelly E., Grezes, Felix, Allen, Thomas, Shapurian, Golnaz, Grant, Carolyn S., Thompson, Donna M., Hostetler, Timothy W., Templeton, Matthew R., Chen, Shinyi, Koch, Jennifer, Jacovich, Taylor, Chivvis, Daniel, Alves, Fernanda de Macedo, Paquin, Jean-Claude, Bartlett, Jennifer, Polimera, Mugdha, Jarmak, Stephanie
However, when large language models are directly prompted with questions without any context, they are prone to hallucination. At NASA SciX we have developed an experiment where we created semantic vectors for our large collection of abstracts and full-text content, and we designed a prompt system to ask questions using contextual chunks from our system. Based on a non-systematic human evaluation, the experiment shows a lower degree of hallucination and better responses when using Retrieval Augmented Generation. Further exploration is required to design new features and data augmentation processes at NASA SciX that leverages this technology while respecting the high level of trust and quality that the project holds.
Identifying Planetary Names in Astronomy Papers: A Multi-Step Approach
Shapurian, Golnaz, Kurtz, Michael J, Accomazzi, Alberto
The automatic identification of planetary feature names in astronomy publications presents numerous challenges. These features include craters, defined as roughly circular depressions resulting from impact or volcanic activity; dorsas, which are elongate raised structures or wrinkle ridges; and lacus, small irregular patches of dark, smooth material on the Moon, referred to as "lake" (Planetary Names Working Group, n.d.). Many feature names overlap with places or people's names that they are named after, for example, Syria, Tempe, Einstein, and Sagan, to name a few (U.S. Geological Survey, n.d.). Some feature names have been used in many contexts, for instance, Apollo, which can refer to mission, program, sample, astronaut, seismic, seismometers, core, era, data, collection, instrument, and station, in addition to the crater on the Moon. Some feature names can appear in the text as adjectives, like the lunar craters Black, Green, and White. Some feature names in other contexts serve as directions, like craters West and South on the Moon. Additionally, some features share identical names across different celestial bodies, requiring disambiguation, such as the Adams crater, which exists on both the Moon and Mars. We present a multi-step pipeline combining rule-based filtering, statistical relevance analysis, part-of-speech (POS) tagging, named entity recognition (NER) model, hybrid keyword harvesting, knowledge graph (KG) matching, and inference with a locally installed large language model (LLM) to reliably identify planetary names despite these challenges. When evaluated on a dataset of astronomy papers from the Astrophysics Data System (ADS), this methodology achieves an F1-score over 0.97 in disambiguating planetary feature names.
AstroLLaMA: Towards Specialized Foundation Models in Astronomy
Nguyen, Tuan Dung, Ting, Yuan-Sen, Ciucฤ, Ioana, O'Neill, Charlie, Sun, Ze-Chang, Jabลoลska, Maja, Kruk, Sandor, Perkowski, Ernest, Miller, Jack, Li, Jason, Peek, Josh, Iyer, Kartheik, Rรณลผaลski, Tomasz, Khetarpal, Pranav, Zaman, Sharaf, Brodrick, David, Mรฉndez, Sergio J. Rodrรญguez, Bui, Thang, Goodman, Alyssa, Accomazzi, Alberto, Naiman, Jill, Cranney, Jesse, Schawinski, Kevin, UniverseTBD, null
Large language models excel in many human-language tasks but often falter in highly specialized domains like scholarly astronomy. To bridge this gap, we introduce AstroLLaMA, a 7-billion-parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv. Optimized for traditional causal language modeling, AstroLLaMA achieves a 30% lower perplexity than Llama-2, showing marked domain adaptation. Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models despite having significantly fewer parameters. AstroLLaMA serves as a robust, domain-specific model with broad fine-tuning potential. Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.
Improving astroBERT using Semantic Textual Similarity
Grezes, Felix, Allen, Thomas, Blanco-Cuaresma, Sergi, Accomazzi, Alberto, Kurtz, Michael J., Shapurian, Golnaz, Henneken, Edwin, Grant, Carolyn S., Thompson, Donna M., Hostetler, Timothy W., Templeton, Matthew R., Lockhart, Kelly E., Chen, Shinyi, Koch, Jennifer, Jacovich, Taylor, Protopapas, Pavlos
The NASA Astrophysics Data System (ADS) is an essential tool for researchers that allows them to explore the astronomy and astrophysics scientific literature, but it has yet to exploit recent advances in natural language processing. At ADASS 2021, we introduced astroBERT, a machine learning language model tailored to the text used in astronomy papers in ADS. In this work we: 1. announce the first public release of the astroBERT language model; 2. show how astroBERT improves over existing public language models on astrophysics specific tasks; 3. and detail how ADS plans to harness the unique structure of scientific papers, the citation graph and citation context, to further improve astroBERT.
Multilingual Topic Models
Krstovski, Kriste, Kurtz, Michael J., Smith, David A., Accomazzi, Alberto
Scientific publications have evolved several features for mitigating vocabulary mismatch when indexing, retrieving, and computing similarity between articles. These mitigation strategies range from simply focusing on high-value article sections, such as titles and abstracts, to assigning keywords, often from controlled vocabularies, either manually or through automatic annotation. Various document representation schemes possess different cost-benefit tradeoffs. In this paper, we propose to model different representations of the same article as translations of each other, all generated from a common latent representation in a multilingual topic model. We start with a methodological overview on latent variable models for parallel document representations that could be used across many information science tasks. We then show how solving the inference problem of mapping diverse representations into a shared topic space allows us to evaluate representations based on how topically similar they are to the original article. In addition, our proposed approach provides means to discover where different concept vocabularies require improvement.