Goto

Collaborating Authors

 Data Science


End-to-end data-driven weather prediction

AIHub

A new AI weather prediction system, developed by a team of researchers from the University of Cambridge, can deliver accurate forecasts which use less computing power than current AI and physics-based forecasting systems. The system, Aardvark Weather, has been supported by the Alan Turing Institute, Microsoft Research and the European Centre for Medium Range Weather Forecasts. It provides a blueprint for a new approach to weather forecasting with the potential to improve current practices. The results are reported in the journal Nature. "Aardvark reimagines current weather prediction methods offering the potential to make weather forecasts faster, cheaper, more flexible and more accurate than ever before, helping to transform weather prediction in both developed and developing countries," said Professor Richard Turner from Cambridge's Department of Engineering, who led the research.


Block Toeplitz Sparse Precision Matrix Estimation for Large-Scale Interval-Valued Time Series Forecasting

arXiv.org Machine Learning

Modeling and forecasting interval-valued time series (ITS) have attracted considerable attention due to their growing presence in various contexts. To the best of our knowledge, there have been no efforts to model large-scale ITS. In this paper, we propose a feature extraction procedure for large-scale ITS, which involves key steps such as auto-segmentation and clustering, and feature transfer learning. This procedure can be seamlessly integrated with any suitable prediction models for forecasting purposes. Specifically, we transform the automatic segmentation and clustering of ITS into the estimation of Toeplitz sparse precision matrices and assignment set. The majorization-minimization algorithm is employed to convert this highly non-convex optimization problem into two subproblems. We derive efficient dynamic programming and alternating direction method to solve these two subproblems alternately and establish their convergence properties. By employing the Joint Recurrence Plot (JRP) to image subsequence and assigning a class label to each cluster, an image dataset is constructed. Then, an appropriate neural network is chosen to train on this image dataset and used to extract features for the next step of forecasting. Real data applications demonstrate that the proposed method can effectively obtain invariant representations of the raw data and enhance forecasting performance.


Spatially-Heterogeneous Causal Bayesian Networks for Seismic Multi-Hazard Estimation: A Variational Approach with Gaussian Processes and Normalizing Flows

arXiv.org Machine Learning

Earthquakes cause harm not only through direct ground shaking but also by triggering secondary ground failures such as landslides and liquefaction. These combined effects lead to devastating consequences, including structural damage and human casualties. A striking illustration is the 2021 Haiti earthquake, which initiated over 7,000 landslides covering more than 80 square kilometers. This catastrophic event resulted in damage or destruction to over 130,000 buildings, claimed 2,248 lives, and left more than 12,200 people injured [1]. Rapidly identifying where and how severely ground failures and structural damage have occurred following an earthquake is essential for effective victim rescue operations within the crucial "Golden 72 Hour" window, and plays a vital role in developing effective post-disaster recovery plans [2, 3]. Over the years, researchers have developed various approaches for estimating the location and intensity of earthquake-induced ground failures and building damage.


Multi-resolution Score-Based Variational Graphical Diffusion for Causal Disaster System Modeling and Inference

arXiv.org Machine Learning

Complex systems with intricate causal dependencies challenge accurate prediction. Effective modeling requires precise physical process representation, integration of interdependent factors, and incorporation of multi-resolution observational data. These systems manifest in both static scenarios with instantaneous causal chains and temporal scenarios with evolving dynamics, complicating modeling efforts. Current methods struggle to simultaneously handle varying resolutions, capture physical relationships, model causal dependencies, and incorporate temporal dynamics, especially with inconsistently sampled data from diverse sources. We introduce Temporal-SVGDM: Score-based Variational Graphical Diffusion Model for Multi-resolution observations. Our framework constructs individual SDEs for each variable at its native resolution, then couples these SDEs through a causal score mechanism where parent nodes inform child nodes' evolution. This enables unified modeling of both immediate causal effects in static scenarios and evolving dependencies in temporal scenarios. In temporal models, state representations are processed through a sequence prediction model to predict future states based on historical patterns and causal relationships. Experiments on real-world datasets demonstrate improved prediction accuracy and causal understanding compared to existing methods, with robust performance under varying levels of background knowledge. Our model exhibits graceful degradation across different disaster types, successfully handling both static earthquake scenarios and temporal hurricane and wildfire scenarios, while maintaining superior performance even with limited data.


Computational Efficient Informative Nonignorable Matrix Completion: A Row- and Column-Wise Matrix U-Statistic Pseudo-Likelihood Approach

arXiv.org Machine Learning

In this study, we establish a unified framework to deal with the high dimensional matrix completion problem under flexible nonignorable missing mechanisms. Although the matrix completion problem has attracted much attention over the years, there are very sparse works that consider the nonignorable missing mechanism. To address this problem, we derive a row- and column-wise matrix U-statistics type loss function, with the nuclear norm for regularization. A singular value proximal gradient algorithm is developed to solve the proposed optimization problem. We prove the non-asymptotic upper bound of the estimation error's Frobenius norm and show the performance of our method through numerical simulations and real data analysis.


Operator Learning: A Statistical Perspective

arXiv.org Machine Learning

Operator learning has emerged as a powerful tool in scientific computing for approximating mappings between infinite-dimensional function spaces. A primary application of operator learning is the development of surrogate models for the solution operators of partial differential equations (PDEs). These methods can also be used to develop black-box simulators to model system behavior from experimental data, even without a known mathematical model. In this article, we begin by formalizing operator learning as a function-to-function regression problem and review some recent developments in the field. We also discuss PDE-specific operator learning, outlining strategies for incorporating physical and mathematical constraints into architecture design and training processes. Finally, we end by highlighting key future directions such as active data collection and the development of rigorous uncertainty quantification frameworks.


Geometric Median Matching for Robust k-Subset Selection from Noisy Data

arXiv.org Artificial Intelligence

Data pruning -- the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. However, existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers. In response, we propose Geometric Median (GM) Matching, a novel k-subset selection strategy that leverages Geometric Median -- a robust estimator with an optimal breakdown point of 1/2; to enhance resilience against noisy data. Our method iteratively selects a k-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption. We provide theoretical guarantees, showing that GM Matching enjoys an improved O(1/k) convergence rate -- a quadratic improvement over random sampling, even under arbitrary corruption. Extensive experiments across image classification and image generation tasks demonstrate that GM Matching consistently outperforms existing pruning approaches, particularly in high-corruption settings and at high pruning rates; making it a strong baseline for robust data pruning.


Graph Attention for Heterogeneous Graphs with Positional Encoding

arXiv.org Machine Learning

Graph Neural Networks (GNNs) have emerged as the de facto standard for modeling graph data, with attention mechanisms and transformers significantly enhancing their performance on graph-based tasks. Despite these advancements, the performance of GNNs on heterogeneous graphs often remains complex, with networks generally underperforming compared to their homogeneous counterparts. This work benchmarks various GNN architectures to identify the most effective methods for heterogeneous graphs, with a particular focus on node classification and link prediction. Our findings reveal that graph attention networks excel in these tasks. As a main contribution, we explore enhancements to these attention networks by integrating positional encodings for node embeddings. This involves utilizing the full Laplacian spectrum to accurately capture both the relative and absolute positions of each node within the graph, further enhancing performance on downstream tasks such as node classification and link prediction.


Analytical Discovery of Manifold with Machine Learning

arXiv.org Machine Learning

A NALYTICALD ISCOVERY OF M ANIFOLD WITH M A-CHINE L EARNING Y afei Shen 1 & Huan-Fei Ma 1, & Ling Y ang 1, 1 School of Mathematical Sciences, Soochow University, Suzhou 215006, China A BSTRACT Understanding low-dimensional structures within high-dimensional data is crucial for visualization, interpretation, and denoising in complex datasets. Despite the advancements in manifold learning techniques, key challenges--such as limited global insight and the lack of interpretable analytical descriptions--remain unresolved. In this work, we introduce a novel framework, GAMLA (Global Analytical Manifold Learning using Auto-encoding). GAMLA employs a two-round training process within an auto-encoding framework to derive both character and complementary representations for the underlying manifold. With the character representation, the manifold is represented by a parametric function which unfold the manifold to provide a global coordinate. While with the complementary representation, an approximate explicit manifold description is developed, offering a global and analytical representation of smooth manifolds underlying high-dimensional datasets. This enables the analytical derivation of geometric properties such as curvature and normal vectors. Moreover, we find the two representations together decompose the whole latent space and can thus characterize the local spatial structure surrounding the manifold, proving particularly effective in anomaly detection and categorization. Through extensive experiments on benchmark datasets and real-world applications, GAMLA demonstrates its ability to achieve computational efficiency and interpretability while providing precise geometric and structural insights. This framework bridges the gap between data-driven manifold learning and analytical geometry, presenting a versatile tool for exploring the intrinsic properties of complex data sets. 1 I NTRODUCTION Discovering low-dimensional structures, particularly their geometric properties, from high-dimensional data clouds enables visualization, denoising, and interpretation of complex datasets (Meil a & Zhang, 2023; Belkin & Niyogi, 2003; van der Maaten & Hinton, 2008; McInnes & Healy, 2018; Luo & Hu, 2020). As a result, the concept of manifold learning has attracted significant attention, leading to numerous breakthroughs over the past two decades.


Graph Network Modeling Techniques for Visualizing Human Mobility Patterns

arXiv.org Machine Learning

Human mobility analysis at urban-scale requires models to represent the complex nature of human movements, which in turn are affected by accessibility to nearby points of interest, underlying socioeconomic factors of a place, and local transport choices for people living in a geographic region. In this work, we represent human mobility and the associated flow of movements as a grapyh. Graph-based approaches for mobility analysis are still in their early stages of adoption and are actively being researched. The challenges of graph-based mobility analysis are multifaceted - the lack of sufficiently high-quality data to represent flows at high spatial and teporal resolution whereas, limited computational resources to translate large voluments of mobility data into a network structure, and scaling issues inherent in graph models etc. The current study develops a methodology by embedding graphs into a continuous space, which alleviates issues related to fast graph matching, graph time-series modeling, and visualization of mobility dynamics. Through experiments, we demonstrate how mobility data collected from taxicab trajectories could be transformed into network structures and patterns of mobility flow changes, and can be used for downstream tasks reporting approx 40% decrease in error on average in matched graphs vs unmatched ones.