data product
Connecting the Dots: A Machine Learning Ready Dataset for Ionospheric Forecasting Models
Wolniewicz, Linnea M., Kelebek, Halil S., Mestici, Simone, Vergalla, Michael D., Acciarini, Giacomo, Poduval, Bala, Verkhoglyadova, Olga, Guhathakurta, Madhulika, Berger, Thomas E., Baydin, Atılım Güneş, Soboczenski, Frank
Operational forecasting of the ionosphere remains a critical space weather challenge due to sparse observations, complex coupling across geospatial layers, and a growing need for timely, accurate predictions that support Global Navigation Satellite System (GNSS), communications, aviation safety, as well as satellite operations. As part of the 2025 NASA Heliolab, we present a curated, open-access dataset that integrates diverse ionospheric and heliospheric measurements into a coherent, machine learning-ready structure, designed specifically to support next-generation forecasting models and address gaps in current operational frameworks. Our workflow integrates a large selection of data sources comprising Solar Dynamic Observatory data, solar irradiance indices (F10.7), solar wind parameters (velocity and interplanetary magnetic field), geomagnetic activity indices (Kp, AE, SYM-H), and NASA JPL's Global Ionospheric Maps of Total Electron Content (GIM-TEC). We also implement geospatially sparse data such as the TEC derived from the World-Wide GNSS Receiver Network and crowdsourced Android smartphone measurements. This novel heterogeneous dataset is temporally and spatially aligned into a single, modular data structure that supports both physical and data-driven modeling. Leveraging this dataset, we train and benchmark several spatiotemporal machine learning architectures for forecasting vertical TEC under both quiet and geomagnetically active conditions. This work presents an extensive dataset and modeling pipeline that enables exploration of not only ionospheric dynamics but also broader Sun-Earth interactions, supporting both scientific inquiry and operational forecasting efforts.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.15)
- North America > United States > New Hampshire (0.04)
- North America > United States > Hawaii (0.04)
- (5 more...)
- Workflow (0.54)
- Research Report (0.40)
- Government > Space Agency (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
How to Sell High-Dimensional Data Optimally
Li, Andrew, Ravi, R., Singh, Karan, Yi, Zihong, Zhang, Weizhong
Motivated by the problem of selling large, proprietary data, we consider an information pricing problem proposed by Bergemann et al. that involves a decision-making buyer and a monopolistic seller. The seller has access to the underlying state of the world that determines the utility of the various actions the buyer may take. Since the buyer gains greater utility through better decisions resulting from more accurate assessments of the state, the seller can therefore promise the buyer supplemental information at a price. To contend with the fact that the seller may not be perfectly informed about the buyer's private preferences (or utility), we frame the problem of designing a data product as one where the seller designs a revenue-maximizing menu of statistical experiments. Prior work by Cai et al. showed that an optimal menu can be found in time polynomial in the state space, whereas we observe that the state space is naturally exponential in the dimension of the data. We propose an algorithm which, given only sampling access to the state space, provably generates a near-optimal menu with a number of samples independent of the state space. We then analyze a special case of high-dimensional Gaussian data, showing that (a) it suffices to consider scalar Gaussian experiments, (b) the optimal menu of such experiments can be found efficiently via a semidefinite program, and (c) full surplus extraction occurs if and only if a natural separation condition holds on the set of potential preferences of the buyer.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > Kosovo > District of Gjilan > Kamenica (0.04)
- North America > United States > Massachusetts (0.04)
- North America > United States > California (0.04)
The Landform Contextual Mesh: Automatically Fusing Surface and Orbital Terrain for Mars 2020
The Landform contextual mesh fuses 2D and 3D data from up to thousands of Mars 2020 rover images, along with orbital elevation and color maps from Mars Reconnaissance Orbiter, into an interactive 3D terrain visualization. Contextual meshes are built automatically for each rover location during mission ground data system processing, and are made available to mission scientists for tactical and strategic planning in the Advanced Science Targeting Tool for Robotic Operations (ASTTRO). A subset of them are also deployed to the "Explore with Perseverance" public access website.
- North America > United States > California > Los Angeles County > Pasadena (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Japan (0.04)
Secure, Scalable and Privacy Aware Data Strategy in Cloud
Butte, Vijay Kumar, Butte, Sujata
The enterprises today are faced with the tough challenge of processing, storing large amounts of data in a secure, scalable manner and enabling decision makers to make quick, informed data driven decisions. This paper addresses this challenge and develops an effective enterprise data strategy in the cloud. Various components of an effective data strategy are discussed and architectures addressing security, scalability and privacy aspects are provided.
Lightweight posterior construction for gravitational-wave catalogs with the Kolmogorov-Arnold network
Liu, Wenshuai, Dong, Yiming, Wang, Ziming, Shao, Lijing
Neural density estimation has seen widespread applications in the gravitational-wave (GW) data analysis, which enables real-time parameter estimation for compact binary coalescences and enhances rapid inference for subsequent analysis such as population inference. In this work, we explore the application of using the Kolmogorov-Arnold network (KAN) to construct efficient and interpretable neural density estimators for lightweight posterior construction of GW catalogs. By replacing conventional activation functions with learnable splines, KAN achieves superior interpretability, higher accuracy, and greater parameter efficiency on related scientific tasks. Leveraging this feature, we propose a KAN-based neural density estimator, which ingests megabyte-scale GW posterior samples and compresses them into model weights of tens of kilobytes. Subsequently, analytic expressions requiring only several kilobytes can be further distilled from these neural network weights with minimal accuracy trade-off. In practice, GW posterior samples with fidelity can be regenerated rapidly using the model weights or analytic expressions for subsequent analysis. Our lightweight posterior construction strategy is expected to facilitate user-level data storage and transmission, paving a path for efficient analysis of numerous GW events in the next-generation GW detectors.
- Asia > China > Beijing > Beijing (0.04)
- Asia > Japan (0.04)
- Europe > United Kingdom (0.04)
- (10 more...)
Challenges and Research Directions from the Operational Use of a Machine Learning Damage Assessment System via Small Uncrewed Aerial Systems at Hurricanes Debby and Helene
Manzini, Thomas, Perali, Priyankari, Murphy, Robin R., Merrick, David
-- This paper details four principal challenges encountered with machine learning (ML) damage assessment using small uncrewed aerial systems (sUAS) at Hurricanes Debby and Helene that prevented, degraded, or delayed the delivery of data products during operations and suggests three research directions for future real-world deployments. The presence of these challenges is not surprising given that a review of the literature considering both datasets and proposed ML models suggests this is the first sUAS-based ML system for disaster damage assessment actually deployed as a part of real-world operations. The sUAS-based ML system was applied by the State of Florida to Hurricanes Helene (2 orthomosaics, 3.0 gigapixels collected over 2 sorties by a Wintra WingtraOne sUAS) and Debby (1 orthomosaic, 0.59 gigapixels collected via 1 sortie by a Wintra WingtraOne sUAS) in Florida. The same model was applied to crewed aerial imagery of inland flood damage resulting from post-tropical remnants of Hurricane Debby in Pennsylvania (436 orthophotos, 136.5 gigapixels), providing further insights into the advantages and limitations of sUAS for disaster response. The four challenges (variation in spatial resolution of input imagery, spatial misalignment between imagery and geospatial data, wireless connectivity, and data product format) lead to three recommendations that specify research needed to improve ML model capabilities to accommodate the wide variation of potential spatial resolutions used in practice, handle spatial misalignment, and minimize the dependency on wireless connectivity. These recommendations are expected to improve the effective operational use of sUAS and sUAS-based ML damage assessment systems for disaster response.
- North America > United States > Texas > Brazos County > College Station (0.14)
- Africa > Eswatini > Manzini > Manzini (0.05)
- North America > United States > Pennsylvania > Tioga County (0.04)
- (3 more...)
An AI-Driven Data Mesh Architecture Enhancing Decision-Making in Infrastructure Construction and Public Procurement
Mishra, Saurabh, Shinde, Mahendra, Yadav, Aniket, Ayyub, Bilal, Rao, Anand
Infrastructure construction, often dubbed an "industry of industries," is closely linked with government spending and public procurement, offering significant opportunities for improved efficiency and productivity through better transparency and information access. By leveraging these opportunities, we can achieve notable gains in productivity, cost savings, and broader economic benefits. Our approach introduces an integrated software ecosystem utilizing Data Mesh and Service Mesh architectures. This system includes the largest training dataset for infrastructure and procurement, encompassing over 100 billion tokens, scientific publications, activities, and risk data, all structured by a systematic AI framework. Supported by a Knowledge Graph linked to domain-specific multi-agent tasks and Q&A capabilities, our platform standardizes and ingests diverse data sources, transforming them into structured knowledge. Leveraging large language models (LLMs) and automation, our system revolutionizes data structuring and knowledge creation, aiding decision-making in early-stage project planning, detailed research, market trend analysis, and qualitative assessments. Its web-scalable architecture delivers domain-curated information, enabling AI agents to facilitate reasoning and manage uncertainties, while preparing for future expansions with specialized agents targeting particular challenges. This integration of AI with domain expertise not only boosts efficiency and decision-making in construction and infrastructure but also establishes a framework for enhancing government efficiency and accelerating the transition of traditional industries to digital workflows. This work is poised to significantly influence AI-driven initiatives in this sector and guide best practices in AI Operations.
- Asia (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- South America > Guyana (0.04)
- (10 more...)
- Workflow (1.00)
- Research Report (1.00)
- Government (1.00)
- Energy (1.00)
- Construction & Engineering (1.00)
- (3 more...)
Free to play: UN Trade and Development's experience with developing its own open-source Retrieval Augmented Generation Large Language Model application
Generative artificial intelligence (AI), and in particular Large Language Models (LLMs), have exploded in popularity and attention since the release to the public of ChatGPT's Generative Pre-trained Transformer (GPT)-3.5 model in November of 2022. Due to the power of these general purpose models and their ability to communicate in natural language, they can be useful in a range of domains, including the work of official statistics and international organizations. However, with such a novel and seemingly complex technology, it can feel as if generative AI is something that happens to an organization, something that can be talked about but not understood, that can be commented on but not contributed to. Additionally, the costs of adoption and operation of proprietary solutions can be both uncertain and high, a barrier for often cost-constrained international organizations. In the face of these challenges, United Nations Trade and Development (UNCTAD), through its Global Crisis Response Group (GCRG), has explored and developed its own open-source Retrieval Augmented Generation (RAG) LLM application. RAG makes LLMs aware of and more useful for the organization's domain and work. Developing in-house solutions comes with pros and cons, with pros including cost, flexibility, and fostering institutional knowledge. Cons include time and skill investments and gaps and application polish and power. The three libraries developed to produce the app, nlp_pipeline for document processing and statistical analysis, local_rag_llm for running a local RAG LLM, and streamlit_rag for the user interface, are publicly available on PyPI and GitHub with Dockerfiles. A fourth library, local_llm_finetune, is also available for fine-tuning existing LLMs which can then be used in the application.
- Europe > Switzerland > Geneva > Geneva (0.04)
- Africa > Southern Africa (0.04)
Empowering Data Mesh with Federated Learning
The evolution of data architecture has seen the rise of data lakes, aiming to solve the bottlenecks of data management and promote intelligent decision-making. However, this centralized architecture is limited by the proliferation of data sources and the growing demand for timely analysis and processing. A new data paradigm, Data Mesh, is proposed to overcome these challenges. Data Mesh treats domains as a first-class concern by distributing the data ownership from the central team to each data domain, while keeping the federated governance to monitor domains and their data products. Many multi-million dollar organizations like Paypal, Netflix, and Zalando have already transformed their data analysis pipelines based on this new architecture. In this decentralized architecture where data is locally preserved by each domain team, traditional centralized machine learning is incapable of conducting effective analysis across multiple domains, especially for security-sensitive organizations. To this end, we introduce a pioneering approach that incorporates Federated Learning into Data Mesh. To the best of our knowledge, this is the first open-source applied work that represents a critical advancement toward the integration of federated learning methods into the Data Mesh paradigm, underscoring the promising prospects for privacy-preserving and decentralized data analysis strategies within Data Mesh architecture.
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
- Europe > Sweden > Uppsala County > Uppsala (0.04)
- Europe > Netherlands > North Brabant > Eindhoven (0.04)
- North America > United States > Connecticut > Fairfield County > Stamford (0.04)
- Research Report > Promising Solution (0.48)
- Overview > Innovation (0.34)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Services > e-Commerce Services (0.34)
Fusing Climate Data Products using a Spatially Varying Autoencoder
Johnson, Jacob A., Heaton, Matthew J., Christensen, William F., Warr, Lynsie R., Rupper, Summer B.
Autoencoders are powerful machine learning models used to compress information from multiple data sources. However, autoencoders, like all artificial neural networks, are often unidentifiable and uninterpretable. This research focuses on creating an identifiable and interpretable autoencoder that can be used to meld and combine climate data products. The proposed autoencoder utilizes a Bayesian statistical framework, allowing for probabilistic interpretations while also varying spatially to capture useful spatial patterns across the various data products. Constraints are placed on the autoencoder as it learns patterns in the data, creating an interpretable consensus that includes the important features from each input. We demonstrate the utility of the autoencoder by combining information from multiple precipitation products in High Mountain Asia.
- Asia (0.24)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- Atlantic Ocean > North Atlantic Ocean (0.04)
- Research Report (0.82)
- Overview (0.68)