AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets

Perkowski, Ernest, Pan, Rui, Nguyen, Tuan Dung, Ting, Yuan-Sen, Kruk, Sandor, Zhang, Tong, O'Neill, Charlie, Jablonska, Maja, Sun, Zechang, Smith, Michael J., Liu, Huiling, Schawinski, Kevin, Iyer, Kartheik, UniverseTBD, Ioana Ciucă for

Jan-5-2024–arXiv.org Artificial Intelligence

To enhance this, we introduce AstroLLaMA-Chat, an advanced version of AstroLLaMA. This new iteration broadens the training scope to include introductions and conclusions of papers, alongside abstracts, as these sections are often rich in pivotal information for question-answering tasks. We initiated by downloading all papers up to July 2023, including all the files that come with a submission to arXiv. The data has been further refined for optimal operability, retaining only files with ".tex" suffixes. Through a multi-stage process, and utilising a comprehensive regex matching process, the extraction of the targeted sections was performed. Given the diverse LaTeX formatting standards, approximately 90% of the samples remained post-processing. Subsequently, we removed specific formatting patterns, comments, and superfluous symbols like newlines to ensure the readability of the training data. Further, we have fine-tuned AstroLLaMA-Chat on a domain-specific dialogue dataset. To generate Question-Answer pairs, we engaged GPT-4 (OpenAI 2023) to formulate pertinent questions from paragraphs within 300,000 arXiv papers, with GPT-4 also tasked with answering these questions by retrieving context-relevant information.

astrollama, astrollama-chat, dimension, (13 more...)

arXiv.org Artificial Intelligence

Jan-5-2024

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.05)
- North America > United States
  - Pennsylvania > Philadelphia County
    - Philadelphia (0.14)
  - Ohio > Franklin County
    - Columbus (0.05)
  - New York > New York County
    - New York City (0.04)
  - Illinois > Champaign County
    - Urbana (0.04)
- Europe
  - United Kingdom (0.14)
  - Switzerland > Zürich
    - Zürich (0.14)
  - Spain > Galicia
    - Madrid (0.04)
- Asia > China
  - Hong Kong (0.04)
  - Beijing > Beijing (0.04)
  - Anhui Province > Hefei (0.04)

Genre:
- Research Report > New Finding (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)