Goto

Collaborating Authors

 Clustering


Statistical arbitrage in multi-pair trading strategy based on graph clustering algorithms in US equities market

arXiv.org Machine Learning

The study seeks to develop an effective strategy based on the novel framework of statistical arbitrage based on graph clustering algorithms. Amalgamation of quantitative and machine learning methods, including the Kelly criterion, and an ensemble of machine learning classifiers have been used to improve risk-adjusted returns and increase immunity to transaction costs over existing approaches. The study seeks to provide an integrated approach to optimal signal detection and risk management. As a part of this approach, innovative ways of optimizing take profit and stop loss functions for daily frequency trading strategies have been proposed and tested. All of the tested approaches outperformed appropriate benchmarks. The best combinations of the techniques and parameters demonstrated significantly better performance metrics than the relevant benchmarks. The results have been obtained under the assumption of realistic transaction costs, but are sensitive to changes in some key parameters.


In-depth analysis of recall initiators of medical devices with a Machine Learning-Natural language Processing workflow

arXiv.org Artificial Intelligence

Recall initiator identification and assessment are the preliminary steps to prevent medical device recall. Conventional analysis tools are inappropriate for processing massive and multi-formatted data comprehensively and completely to meet the higher expectations of delicacy management with the increasing overall data volume and textual data format. This study presents a bigdata-analytics-based machine learning-natural language processing work tool to address the shortcomings in dealing efficiency and data process versatility of conventional tools in the practical context of big data volume and muti data format. This study identified, assessed and analysed the medical device recall initiators according to the public medical device recall database from 2018 to 2024 with the ML-NLP tool. The results suggest that the unsupervised Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm can present each single recall initiator in a specific manner, therefore helping practitioners to identify the recall reasons comprehensively and completely within a short time frame. This is then followed by text similarity-based textual classification to assist practitioners in controlling the group size of recall initiators and provide managerial insights from the operational to the tactical and strategical levels. This ML-NLP work tool can not only capture specific details of each recall initiator but also interpret the inner connection of each existing initiator and can be implemented for risk identification and assessment in the forward SC. Finally, this paper suggests some concluding remarks and presents future works. More proactive practices and control solutions for medical device recalls are expected in the future.


TACCO: Task-guided Co-clustering of Clinical Concepts and Patient Visits for Disease Subtyping based on EHR Data

arXiv.org Artificial Intelligence

The growing availability of well-organized Electronic Health Records (EHR) data has enabled the development of various machine learning models towards disease risk prediction. However, existing risk prediction methods overlook the heterogeneity of complex diseases, failing to model the potential disease subtypes regarding their corresponding patient visits and clinical concept subgroups. In this work, we introduce TACCO, a novel framework that jointly discovers clusters of clinical concepts and patient visits based on a hypergraph modeling of EHR data. Specifically, we develop a novel self-supervised co-clustering framework that can be guided by the risk prediction task of specific diseases. Furthermore, we enhance the hypergraph model of EHR data with textual embeddings and enforce the alignment between the clusters of clinical concepts and patient visits through a contrastive objective. Comprehensive experiments conducted on the public MIMIC-III dataset and Emory internal CRADLE dataset over the downstream clinical tasks of phenotype classification and cardiovascular risk prediction demonstrate an average 31.25% performance improvement compared to traditional ML baselines and a 5.26% improvement on top of the vanilla hypergraph model without our co-clustering mechanism. In-depth model analysis, clustering results analysis, and clinical case studies further validate the improved utilities and insightful interpretations delivered by TACCO. Code is available at https://github.com/PericlesHat/TACCO.


Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps

arXiv.org Artificial Intelligence

Real-time detection and prediction of extreme weather protect human lives and infrastructure. Traditional methods rely on numerical threshold setting and manual interpretation of weather heatmaps with Geographic Information Systems (GIS), which can be slow and error-prone. Our research redefines Extreme Weather Events Detection (EWED) by framing it as a Visual Question Answering (VQA) problem, thereby introducing a more precise and automated solution. Leveraging Vision-Language Models (VLM) to simultaneously process visual and textual data, we offer an effective aid to enhance the analysis process of weather heatmaps. Our initial assessment of general-purpose VLMs (e.g., GPT-4-Vision) on EWED revealed poor performance, characterized by low accuracy and frequent hallucinations due to inadequate color differentiation and insufficient meteorological knowledge. To address these challenges, we introduce ClimateIQA, the first meteorological VQA dataset, which includes 8,760 wind gust heatmaps and 254,040 question-answer pairs covering four question types, both generated from the latest climate reanalysis data. We also propose Sparse Position and Outline Tracking (SPOT), an innovative technique that leverages OpenCV and K-Means clustering to capture and depict color contours in heatmaps, providing ClimateIQA with more accurate color spatial location information. Finally, we present Climate-Zoo, the first meteorological VLM collection, which adapts VLMs to meteorological applications using the ClimateIQA dataset. Experiment results demonstrate that models from Climate-Zoo substantially outperform state-of-the-art general VLMs, achieving an accuracy increase from 0% to over 90% in EWED verification. The datasets and models in this study are publicly available for future climate science research: https://github.com/AlexJJJChen/Climate-Zoo.


Neural Concept Binder

arXiv.org Artificial Intelligence

The challenge in object-based visual reasoning lies in generating descriptive yet distinct concept representations. Moreover, doing this in an unsupervised fashion requires human users to understand a model's learned concepts and potentially revise false concepts. In addressing this challenge, we introduce the Neural Concept Binder, a new framework for deriving discrete concept representations resulting in what we term "concept-slot encodings". These encodings leverage both "soft binding" via object-centric block-slot encodings and "hard binding" via retrieval-based inference. The Neural Concept Binder facilitates straightforward concept inspection and direct integration of external knowledge, such as human input or insights from other AI models like GPT-4. Additionally, we demonstrate that incorporating the hard binding mechanism does not compromise performance; instead, it enables seamless integration into both neural and symbolic modules for intricate reasoning tasks, as evidenced by evaluations on our newly introduced CLEVR-Sudoku dataset.


Unraveling Anomalies in Time: Unsupervised Discovery and Isolation of Anomalous Behavior in Bio-regenerative Life Support System Telemetry

arXiv.org Artificial Intelligence

Bio-regenerative Life Support Systems (BLSSs) are artificial ecosystems that consist of multiple symbiotic relationships. BLSSs are crucial for sustaining long-duration space missions by facilitating food production and managing essential material cycles for respiratory air, water, biomass, and waste. The EDEN NEXT GEN Project, part of the EDEN roadmap at the German Aerospace Center (DLR), aims to develop a fully integrated ground demonstrator of a BLSS comprising all subsystems, with the ultimate goal of realizing a flight-ready BLSS within the next decade. This initiative builds upon insights from the EDEN ISS project, which investigated controlled environment agriculture (CEA) technologies for space exploration. EDEN ISS, a near-closed-loop research greenhouse deployed in Antarctica from 2017 to 2021, focused on crop production, including lettuces, bell peppers, leafy greens, and various herbs. To ensure the safe and stable operation of BLSSs, we explore methods to mitigate risks regarding system health, particularly regarding food production and nourishment shortages for isolated crews.


Personalized Product Assortment with Real-time 3D Perception and Bayesian Payoff Estimation

arXiv.org Artificial Intelligence

Product assortment selection is a critical challenge facing physical retailers. Effectively aligning inventory with the preferences of shoppers can increase sales and decrease out-of-stocks. However, in real-world settings the problem is challenging due to the combinatorial explosion of product assortment possibilities. Consumer preferences are typically heterogeneous across space and time, making inventory-preference alignment challenging. Additionally, existing strategies rely on syndicated data, which tends to be aggregated, low resolution, and suffer from high latency. To solve these challenges, we introduce a real-time recommendation system, which we call EdgeRec3D. Our system utilizes recent advances in 3D computer vision for perception and automatic, fine grained sales estimation. These perceptual components run on the edge of the network and facilitate real-time reward signals. Additionally, we develop a Bayesian payoff model to account for noisy estimates from 3D LIDAR data. We rely on spatial clustering to allow the system to adapt to heterogeneous consumer preferences, and a graph-based candidate generation algorithm to address the combinatorial search problem. We test our system in real-world stores across two, 6-8 week A/B tests with beverage products and demonstrate a 35% and 27% increase in sales respectively. Finally, we monitor the deployed system for a period of 28 weeks with an observational study and show a 9.4% increase in sales.


Dynamic Correlation Clustering in Sublinear Update Time

arXiv.org Artificial Intelligence

Clustering is a cornerstone of contemporary machine learning and data analysis. A successful clustering algorithm partitions data elements so that similar items reside within the same group, while dissimilar items are separated. Introduced in 2004 by Bansal, Blum and Chawla Bansal et al. ((2004)), the correlation clustering objective offers a natural approach to model this problem. Due to its concise and elegant formulation, this problem has drawn significant interest from researchers and practitioners, leading to applications across diverse domains. These include ensemble clustering identification ((Bonchi et al., 2013)), duplicate detection ((Arasu et al., 2009)), community mining ((Chen et al., 2012)), disambiguation tasks ((Kalashnikov et al., 2008)), automated labeling ((Agrawal et al., 2009; Chakrabarti et al., 2008)), and many more. In the correlation clustering problem we are given a graph where each edge has either a positive or negative label, and where a positive edge (u, v) indicates that u, v are similar elements (and a negative edge (u, v) indicates that u, v are dissimilar), the objective is to compute a partition of the graph that minimizes the number of negative edges within clusters plus positive edges between clusters. Since the problem is NP-hard, researchers have focused on designing approximation algorithms. The algorithm proposed by Cao et al. ((2024)) achieves an approximation ratio of 1.43 + ϵ, improving upon the previous 1.73 + ϵ and 1.994 + ϵ achieved by Cohen-Addad et al. ((2023, 2022b)). Prior to these developments, the best approximation guarantee of 2.06 was attained by the algorithm of Chawla et al. ((2015)).


Federated Incomplete Multi-View Clustering with Heterogeneous Graph Neural Networks

arXiv.org Artificial Intelligence

Federated multi-view clustering offers the potential to develop a global clustering model using data distributed across multiple devices. However, current methods face challenges due to the absence of label information and the paramount importance of data privacy. A significant issue is the feature heterogeneity across multi-view data, which complicates the effective mining of complementary clustering information. Additionally, the inherent incompleteness of multi-view data in a distributed setting can further complicate the clustering process. To address these challenges, we introduce a federated incomplete multi-view clustering framework with heterogeneous graph neural networks (FIM-GNNs). In the proposed FIM-GNNs, autoencoders built on heterogeneous graph neural network models are employed for feature extraction of multi-view data at each client site. At the server level, heterogeneous features from overlapping samples of each client are aggregated into a global feature representation. Global pseudo-labels are generated at the server to enhance the handling of incomplete view data, where these labels serve as a guide for integrating and refining the clustering process across different data views. Comprehensive experiments have been conducted on public benchmark datasets to verify the performance of the proposed FIM-GNNs in comparison with state-of-the-art algorithms.


Bias-Corrected Joint Spectral Embedding for Multilayer Networks with Invariant Subspace: Entrywise Eigenvector Perturbation and Inference

arXiv.org Machine Learning

In this paper, we propose to estimate the invariant subspace across heterogeneous multiple networks using a novel bias-corrected joint spectral embedding algorithm. The proposed algorithm recursively calibrates the diagonal bias of the sum of squared network adjacency matrices by leveraging the closed-form bias formula and iteratively updates the subspace estimator using the most recent estimated bias. Correspondingly, we establish a complete recipe for the entrywise subspace estimation theory for the proposed algorithm, including a sharp entrywise subspace perturbation bound and the entrywise eigenvector central limit theorem. Leveraging these results, we settle two multiple network inference problems: the exact community detection in multilayer stochastic block models and the hypothesis testing of the equality of membership profiles in multilayer mixed membership models. Our proof relies on delicate leave-one-out and leave-two-out analyses that are specifically tailored to block-wise symmetric random matrices and a martingale argument that is of fundamental interest for the entrywise eigenvector central limit theorem.