Goto

Collaborating Authors

 Clustering


Risk Analysis of Flowlines in the Oil and Gas Sector: A GIS and Machine Learning Approach

arXiv.org Artificial Intelligence

This paper presents a risk analysis of flowlines in the oil and gas sector using Geographic Information Systems (GIS) and machine learning (ML). Flowlines, vital conduits transporting oil, gas, and water from wellheads to surface facilities, often face under-assessment compared to transmission pipelines. This study addresses this gap using advanced tools to predict and mitigate failures, improving environmental safety and reducing human exposure. Extensive datasets from the Colorado Energy and Carbon Management Commission (ECMC) were processed through spatial matching, feature engineering, and geometric extraction to build robust predictive models. Various ML algorithms, including logistic regression, support vector machines, gradient boosting decision trees, and K-Means clustering, were used to assess and classify risks, with ensemble classifiers showing superior accuracy, especially when paired with Principal Component Analysis (PCA) for dimensionality reduction. Finally, a thorough data analysis highlighted spatial and operational factors influencing risks, identifying high-risk zones for focused monitoring. Overall, the study demonstrates the transformative potential of integrating GIS and ML in flowline risk management, proposing a data-driven approach that emphasizes the need for accurate data and refined models to improve safety in petroleum extraction.


Fast instance-specific algorithm configuration with graph neural network

arXiv.org Artificial Intelligence

Combinatorial optimization (CO) problems are pivotal across various industrial applications, where the speed of solving these problems is crucial. Improving the performance of CO solvers across diverse input instances requires fine-tuning solver parameters for each instance. However, this tuning process is time-consuming, and the time required increases with the number of instances. To address this, a method called instance-specific algorithm configuration (ISAC) has been devised. This approach involves two main steps: training and execution. During the training step, features are extracted from various instances and then grouped into clusters. For each cluster, parameters are fine-tuned. This cluster-specific tuning process results in a set of generalized parameters for instances belonging to each class. In the execution step, features are extracted from an unknown instance to determine its cluster, and the corresponding pre-tuned parameters are applied. Generally, the running time of a solver is evaluated by the time to solution ($TTS$). However, methods like ISAC require preprocessing. Therefore, the total execution time is $T_{tot}=TTS+T_{tune}$, where $T_{tune}$ represents the tuning time. While the goal is to minimize $T_{tot}$, it is important to note that extracting features in the ISAC method requires a certain amount of computational time. The extracting features include summary statistics of the solver execution logs, which takes several 10 seconds. This research presents a method to significantly reduce the time of the ISAC execution step by streamlining feature extraction and class determination with a graph neural network. Experimental results show that $T_{tune}$ in the execution step, which take several 10 seconds in the original ISAC manner, could be reduced to sub-seconds.


On the Power of SVD in the Stochastic Block Model

Neural Information Processing Systems

A popular heuristic method for improving clustering results is to apply dimensionality reduction before running clustering algorithms.It has been observed that spectral-based dimensionality reduction tools, such as PCA or SVD, improve the performance of clustering algorithms in many applications. This phenomenon indicates that spectral method not only serves as a dimensionality reduction tool, but also contributes to the clustering procedure in some sense. It is an interesting question to understand the behavior of spectral steps in clustering problems.As an initial step in this direction, this paper studies the power of vanilla-SVD algorithm in the stochastic block model (SBM). We show that, in the symmetric setting, vanilla-SVD algorithm recovers all clusters correctly. This result answers an open question posed by Van Vu (Combinatorics Probability and Computing, 2018) in the symmetric setting.


Hierarchical clustering with dot products recovers hidden tree structure

Neural Information Processing Systems

In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.


AiluRus: A Scalable ViT Framework for Dense Prediction

Neural Information Processing Systems

Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, their complexity dramatically increases when handling long token sequences, particularly for dense prediction tasks that require high-resolution input. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we select anchors from the token sequence using the proposed spatial-aware density-based clustering algorithm. Tokens that are adjacent to anchors are merged to form low-resolution regions, while others are preserved independently as high-resolution.


Coresets for Vertical Federated Learning: Regularized Linear Regression and K -Means Clustering

Neural Information Processing Systems

Vertical federated learning (VFL), where data features are stored in multiple parties distributively, is an important area in machine learning. However, the communication complexity for VFL is typically very high. In this paper, we propose a unified framework by constructing \emph{coresets} in a distributed fashion for communication-efficient VFL. We study two important learning tasks in the VFL setting: regularized linear regression and k -means clustering, and apply our coreset framework to both problems. We theoretically show that using coresets can drastically alleviate the communication complexity, while nearly maintain the solution quality. Numerical experiments are conducted to corroborate our theoretical findings.


Multi-Swap k-Means++

Neural Information Processing Systems

The k -means algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular k -means clustering objective and is known to give an O(\log k) -approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting k -means with O(k \log \log k) local-search steps obtained through the k -means sampling distribution to yield a c -approximation to the k -means clustering problem, where c is a large absolute constant. Here we generalize and extend their local-search algorithm by considering larger and more sophisticated local-search neighborhoods hence allowing to swap multiple centers at the same time. Our algorithm achieves a 9 \varepsilon approximation ratio, which is the best possible for local search. Importantly we show that our algorithm is practical, namely easy to implement and fast enough to run on a variety of classic datasets, and outputs solutions of better cost.


Dependent nonparametric trees for dynamic hierarchical clustering

Neural Information Processing Systems

Hierarchical clustering methods offer an intuitive and powerful way to model a wide variety of data sets. However, the assumption of a fixed hierarchy is often overly restrictive when working with data generated over a period of time: We expect both the structure of our hierarchy, and the parameters of the clusters, to evolve with time. In this paper, we present a distribution over collections of time-dependent, infinite-dimensional trees that can be used to model evolving hierarchies, and present an efficient and scalable algorithm for performing approximate inference in such a model. We demonstrate the efficacy of our model and inference algorithm on both synthetic data and real-world document corpora.


Hierarchical Agglomerative Graph Clustering in Poly-Logarithmic Depth

Neural Information Processing Systems

Obtaining scalable algorithms for \emph{hierarchical agglomerative clustering} (HAC) is of significant interest due to the massive size of real-world datasets. At the same time, efficiently parallelizing HAC is difficult due to the seemingly sequential nature of the algorithm. In this paper, we address this issue and present ParHAC, the first efficient parallel HAC algorithm with sublinear depth for the widely-used average-linkage function. In particular, we provide a (1 \epsilon) -approximation algorithm for this problem on m edge graphs using \tilde{O}(m) work and poly-logarithmic depth. Moreover, we show that obtaining similar bounds for \emph{exact} average-linkage HAC is not possible under standard complexity-theoretic assumptions.We complement our theoretical results with a comprehensive study of the ParHAC algorithm in terms of its scalability, performance, and quality, and compare with several state-of-the-art sequential and parallel baselines.


Faster DBSCAN via subsampled similarity queries

Neural Information Processing Systems

DBSCAN is a popular density-based clustering algorithm. It computes the \epsilon -neighborhood graph of a dataset and uses the connected components of the high-degree nodes to decide the clusters. However, the full neighborhood graph may be too costly to compute with a worst-case complexity of O(n 2) . In this paper, we propose a simple variant called SNG-DBSCAN, which clusters based on a subsampled \epsilon -neighborhood graph, only requires access to similarity queries for pairs of points and in particular avoids any complex data structures which need the embeddings of the data points themselves. The runtime of the procedure is O(sn 2), where s is the sampling rate. We show under some natural theoretical assumptions that s \approx \log n/n is sufficient for statistical cluster recovery guarantees leading to an O(n\log n) complexity.