earth science
Terra: A Multimodal Spatio-Temporal Dataset Spanning the Earth
Since the inception of our planet, the meteorological environment, as reflected through spatio-temporal data, has always been a fundamental factor influencing human life, socio-economic progress, and ecological conservation. A comprehensive exploration of this data is thus imperative to gain a deeper understanding and more accurate forecasting of these environmental shifts. Despite the success of deep learning techniques within the realm of spatio-temporal data and earth science, existing public datasets are beset with limitations in terms of spatial scale, temporal coverage, and reliance on limited time series data. These constraints hinder their optimal utilization in practical applications.
Using AI to speed up landslide detection
On 3 April 2024, a magnitude 7.4 quake--Taiwan's strongest in 25 years--shook the country's eastern coast. Stringent building codes spared most structures, but mountainous and remote villages were devastated by landslides. When disasters affect large and inaccessible areas, responders often turn to satellite images to pinpoint affected areas and prioritise relief efforts. But mapping landslides from satellite imagery by eye can be time-intensive, said Lorenzo Nava, who is jointly based at Cambridge's Departments of Earth Sciences and Geography. "In the aftermath of a disaster, time really matters," he said.
EarthSE: A Benchmark for Evaluating Earth Scientific Exploration Capability of LLMs
Xu, Wanghan, Zhao, Xiangyu, Zhou, Yuhao, Yue, Xiaoyu, Fei, Ben, Ling, Fenghua, Zhang, Wenlong, Bai, Lei
Advancements in Large Language Models (LLMs) drive interest in scientific applications, necessitating specialized benchmarks such as Earth science. Existing benchmarks either present a general science focus devoid of Earth science specificity or cover isolated subdomains, lacking holistic evaluation. Furthermore, current benchmarks typically neglect the assessment of LLMs' capabilities in open-ended scientific exploration. In this paper, we present a comprehensive and professional benchmark for the Earth sciences, designed to evaluate the capabilities of LLMs in scientific exploration within this domain, spanning from fundamental to advanced levels. Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron, which offers extensive question coverage for broad assessment, and Earth-Silver, which features a higher level of difficulty to evaluate professional depth. These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing foundational knowledge crucial for scientific exploration. Most notably, we introduce Earth-Gold with new metrics, a dataset comprising open-ended multi-turn dialogues specifically designed to evaluate the advanced capabilities of LLMs in scientific exploration, including methodology induction, limitation analysis, and concept proposal. Extensive experiments reveal limitations in 11 leading LLMs across different domains and tasks, highlighting considerable room for improvement in their scientific exploration capabilities. The benchmark is available on https://huggingface.co/ai-earth .
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Hong Kong (0.04)
- Africa > North Africa (0.04)
- (4 more...)
- Overview (1.00)
- Research Report > Promising Solution (0.46)
Terra: A Multimodal Spatio-Temporal Dataset Spanning the Earth
Since the inception of our planet, the meteorological environment, as reflected through spatio-temporal data, has always been a fundamental factor influencing human life, socio-economic progress, and ecological conservation. A comprehensive exploration of this data is thus imperative to gain a deeper understanding and more accurate forecasting of these environmental shifts. Despite the success of deep learning techniques within the realm of spatio-temporal data and earth science, existing public datasets are beset with limitations in terms of spatial scale, temporal coverage, and reliance on limited time series data. These constraints hinder their optimal utilization in practical applications. To address these issues, we introduce Terra, a multimodal spatio-temporal dataset spanning the earth.
GeoGalactica: A Scientific Large Language Model in Geoscience
Lin, Zhouhan, Deng, Cheng, Zhou, Le, Zhang, Tianhang, Xu, Yi, Xu, Yutong, He, Zhongmou, Shi, Yuanyuan, Dai, Beiya, Song, Yunchong, Zeng, Boyi, Chen, Qiyuan, Shi, Tao, Huang, Tianyu, Xu, Yiwei, Wang, Shu, Fu, Luoyi, Zhang, Weinan, He, Junxian, Ma, Chao, Zhu, Yunqiang, Wang, Xinbing, Zhou, Chenghu
Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP). Due to their impressive abilities, LLMs have shed light on potential inter-discipline applications to foster scientific discoveries of a specific domain by using artificial intelligence (AI for science, AI4S). In the meantime, utilizing NLP techniques in geoscience research and practice is wide and convoluted, contributing from knowledge extraction and document classification to question answering and knowledge discovery. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest language model for the geoscience domain. More specifically, GeoGalactica is from further pre-training of Galactica. We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens curated from extensive data sources in the big science project Deep-time Digital Earth (DDE), preserving as the largest geoscience-specific text corpus. Then we fine-tune the model with 1 million pairs of instruction-tuning data consisting of questions that demand professional geoscience knowledge to answer. In this technical report, we will illustrate in detail all aspects of GeoGalactica, including data collection, data cleaning, base model selection, pre-training, SFT, and evaluation. We open-source our data curation tools and the checkpoints of GeoGalactica during the first 3/4 of pre-training.
- North America > United States (0.93)
- Asia > Middle East (0.27)
- Europe (0.14)
- Asia > China > Sichuan Province (0.14)
- Materials (1.00)
- Law (1.00)
- Information Technology (1.00)
- (5 more...)
Utilising a Large Language Model to Annotate Subject Metadata: A Case Study in an Australian National Research Data Catalogue
Zhang, Shiwei, Wu, Mingfang, Zhang, Xiuzhen
In support of open and reproducible research, there has been a rapidly increasing number of datasets made available for research. As the availability of datasets increases, it becomes more important to have quality metadata for discovering and reusing them. Yet, it is a common issue that datasets often lack quality metadata due to limited resources for data curation. Meanwhile, technologies such as artificial intelligence and large language models (LLMs) are progressing rapidly. Recently, systems based on these technologies, such as ChatGPT, have demonstrated promising capabilities for certain data curation tasks. This paper proposes to leverage LLMs for cost-effective annotation of subject metadata through the LLM-based in-context learning. Our method employs GPT-3.5 with prompts designed for annotating subject metadata, demonstrating promising performance in automatic metadata annotation. However, models based on in-context learning cannot acquire discipline-specific rules, resulting in lower performance in several categories. This limitation arises from the limited contextual information available for subject inference. To the best of our knowledge, we are introducing, for the first time, an in-context learning method that harnesses large language models for automated subject metadata annotation.
- Oceania > New Zealand (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- Oceania > Australia > Tasmania (0.04)
- (3 more...)
IBM and NASA teamed up to build the GPT of Earth sciences
NASA estimates that its Earth science missions will generate around a quarter million terabytes of data in 2024 alone. In order for climate scientists and the research community efficiently dig through these reams of raw satellite data, IBM, HuggingFace and NASA have collaborated to build an open-source geospatial foundation model that will serve as the basis for a new class of climate and Earth science AIs that can track deforestation, predict crop yields and rack greenhouse gas emissions. For this project, IBM leveraged its recently-released Watsonx.ai to serve as the foundational model using a year's worth of NASA's Harmonized Landsat Sentinel-2 satellite data (HLS). That data is collected by the ESA's pair of Sentinel-2 satellites, which are built to acquire high resolution optical imagery over land and coastal regions in 13 spectral bands. For it's part, HuggingFace is hosting the model on its open-source AI platform.
- Government > Space Agency (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
ARSET - Fundamentals of Machine Learning for Earth Science
Artificial intelligence and machine learning have grown in popularity in recent decades as a result of advances in high-performance computing and open-source software. At the core, machine learning provides a statistical inference based on the inputs provided by the user, in which algorithms learn relationships between input data and output results. The complexity of these algorithms allows for the discovery of patterns and trends invisible to the human analyst, making it important to create analysis-appropriate input for these models to ensure that they answer the questions we are asking. This training will provide attendees an overview of machine learning in regards to Earth Science, and how to apply these algorithms and techniques to remote sensing data in a meaningful way. Attendees will also be provided with end-to-end case study examples for generating a simple random forest model for land cover classification from optical remote sensing.
- Government > Space Agency (0.91)
- Government > Regional Government > North America Government > United States Government (0.91)
Quantum-inspired tensor network for Earth science
Otgonbaatar, Soronzonbold, Kranzlmüller, Dieter
Deep Learning (DL) is one of many successful methodologies to extract informative patterns and insights from ever increasing noisy large-scale datasets (in our case, satellite images). However, DL models consist of a few thousand to millions of training parameters, and these training parameters require tremendous amount of electrical power for extracting informative patterns from noisy large-scale datasets (e.g., computationally expensive). Hence, we employ a quantum-inspired tensor network for compressing trainable parameters of physics-informed neural networks (PINNs) in Earth science. PINNs are DL models penalized by enforcing the law of physics; in particular, the law of physics is embedded in DL models. In addition, we apply tensor decomposition to HyperSpectral Images (HSIs) to improve their spectral resolution. A quantum-inspired tensor network is also the native formulation to efficiently represent and train quantum machine learning models on big datasets on GPU tensor cores. Furthermore, the key contribution of this paper is twofold: (I) we reduced a number of trainable parameters of PINNs by using a quantum-inspired tensor network, and (II) we improved the spectral resolution of remotely-sensed images by employing tensor decomposition. As a benchmark PDE, we solved Burger's equation. As practical satellite data, we employed HSIs of Indian Pine, USA and of Pavia University, Italy.
- North America > United States (0.26)
- Europe > Italy (0.26)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Robust Causality and False Attribution in Data-Driven Earth Science Discoveries
Eldhose, Elizabeth, Chauhan, Tejasvi, Chandel, Vikram, Ghosh, Subimal, Ganguly, Auroop R.
Causal and attribution studies are essential for earth scientific discoveries and critical for informing climate, ecology, and water policies. However, the current generation of methods needs to keep pace with the complexity of scientific and stakeholder challenges and data availability combined with the adequacy of data-driven methods. Unless carefully informed by physics, they run the risk of conflating correlation with causation or getting overwhelmed by estimation inaccuracies. Given that natural experiments, controlled trials, interventions, and counterfactual examinations are often impractical, information-theoretic methods have been developed and are being continually refined in the earth sciences. Here we show that transfer entropy-based causal graphs, which have recently become popular in the earth sciences with high-profile discoveries, can be spurious even when augmented with statistical significance. We develop a subsample-based ensemble approach for robust causality analysis. Simulated data, and observations in climate and ecohydrology, suggest the robustness and consistency of this approach.
- Asia > India > Maharashtra > Mumbai (0.04)
- North America > United States > Washington > Benton County > Richland (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (6 more...)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)