Clustering
Urinary Tract Infection Detection in Digital Remote Monitoring: Strategies for Managing Participant-Specific Prediction Complexity
Fan, Kexin, Capstick, Alexander, Nilforooshan, Ramin, Barnaghi, Payam
Urinary tract infections (UTIs) are a significant health concern, particularly for people living with dementia (PLWD), as they can lead to severe complications if not detected and treated early. This study builds on previous work that utilised machine learning (ML) to detect UTIs in PLWD by analysing in-home activity and physiological data collected through low-cost, passive sensors. The current research focuses on improving the performance of previous models, particularly by refining the Multilayer Perceptron (MLP), to better handle variations in home environments and improve sex fairness in predictions by making use of concepts from multitask learning. This study implemented three primary model designs: feature clustering, loss-dependent clustering, and participant ID embedding which were compared against a baseline MLP model. The results demonstrated that the loss-dependent MLP achieved the most significant improvements, increasing validation precision from 48.92% to 72.60% and sensitivity from 27.44% to 70.52%, while also enhancing model fairness across sexes. These findings suggest that the refined models offer a more reliable and equitable approach to early UTI detection in PLWD, addressing participant-specific data variations and enabling clinicians to detect and screen for UTI risks more effectively, thereby facilitating earlier and more accurate treatment decisions.
Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework
Rahal, Manal, Ahmed, Bestoun S., Szabados, Gergely, Fornstedt, Torgny, Samuelsson, Jorgen
Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore, tedious and time-consuming work goes into data preparation and improvement before moving further in the ML pipeline. To address this challenge, we propose an intelligent data-centric evaluation framework that can identify high-quality data and improve the performance of an ML system. The proposed framework combines the curation of quality measurements and unsupervised learning to distinguish high- and low-quality data. The framework is designed to integrate flexible and general-purpose methods so that it is deployed in various domains and applications. To validate the outcomes of the designed framework, we implemented it in a real-world use case from the field of analytical chemistry, where it is tested on three datasets of anti-sense oligonucleotides. A domain expert is consulted to identify the relevant quality measurements and evaluate the outcomes of the framework. The results show that the quality-centric data evaluation framework identifies the characteristics of high-quality data that guide the conduct of efficient laboratory experiments and consequently improve the performance of the ML system.
Cross-Domain Continual Learning for Edge Intelligence in Wireless ISAC Networks
Hu, Jingzhi, Li, Xin, Su, Zhou, Luo, Jun
In wireless networks with integrated sensing and communications (ISAC), edge intelligence (EI) is expected to be developed at edge devices (ED) for sensing user activities based on channel state information (CSI). However, due to the CSI being highly specific to users' characteristics, the CSI-activity relationship is notoriously domain dependent, essentially demanding EI to learn sufficient datasets from various domains in order to gain cross-domain sensing capability. This poses a crucial challenge owing to the EDs' limited resources, for which storing datasets across all domains will be a significant burden. In this paper, we propose the EdgeCL framework, enabling the EI to continually learn-then-discard each incoming dataset, while remaining resilient to catastrophic forgetting. We design a transformer-based discriminator for handling sequences of noisy and nonequispaced CSI samples. Besides, we propose a distilled core-set based knowledge retention method with robustness-enhanced optimization to train the discriminator, preserving its performance for previous domains while preventing future forgetting. Experimental evaluations show that EdgeCL achieves 89% of performance compared to cumulative training while consuming only 3% of its memory, mitigating forgetting by 79%.
Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options
Nair, Lakshmi, Trase, Ian, Kim, Mark
We present a novel reasoning approach called Flow-of-Options (FoO), designed to address intrinsic biases in Large Language Models (LLMs). FoO enables LLMs to systematically explore a diverse range of possibilities in their reasoning, as demonstrated by an FoO-based agentic system for autonomously solving Machine Learning tasks (AutoML). Our framework outperforms state-of-the-art baselines, achieving improvements of 38.2% - 69.2% on standard data science tasks, and 37.4% - 47.9% on therapeutic chemistry tasks. With an overall operation cost under $1 per task, our framework is well-suited for cost-sensitive applications. Beyond classification and regression, we illustrate the broader applicability of our FoO-based agentic system to tasks such as reinforcement learning and image generation. Our framework presents significant advancements compared to current state-of-the-art agentic systems for AutoML, due to the benefits of FoO in enforcing diversity in LLM solutions through compressed, explainable representations that also support long-term memory when combined with case-based reasoning.
Approximate Tree Completion and Learning-Augmented Algorithms for Metric Minimum Spanning Trees
Veldt, Nate, Stanley, Thomas, Priest, Benjamin W., Steil, Trevor, Iwabuchi, Keita, Jayram, T. S., Sanders, Geoffrey
Finding a minimum spanning tree (MST) for $n$ points in an arbitrary metric space is a fundamental primitive for hierarchical clustering and many other ML tasks, but this takes $\Omega(n^2)$ time to even approximate. We introduce a framework for metric MSTs that first (1) finds a forest of disconnected components using practical heuristics, and then (2) finds a small weight set of edges to connect disjoint components of the forest into a spanning tree. We prove that optimally solving the second step still takes $\Omega(n^2)$ time, but we provide a subquadratic 2.62-approximation algorithm. In the spirit of learning-augmented algorithms, we then show that if the forest found in step (1) overlaps with an optimal MST, we can approximate the original MST problem in subquadratic time, where the approximation factor depends on a measure of overlap. In practice, we find nearly optimal spanning trees for a wide range of metrics, while being orders of magnitude faster than exact algorithms.
$k$-Graph: A Graph Embedding for Interpretable Time Series Clustering
Boniol, Paul, Tiano, Donato, Bonifati, Angela, Palpanas, Themis
Time series clustering poses a significant challenge with diverse applications across domains. A prominent drawback of existing solutions lies in their limited interpretability, often confined to presenting users with centroids. In addressing this gap, our work presents $k$-Graph, an unsupervised method explicitly crafted to augment interpretability in time series clustering. Leveraging a graph representation of time series subsequences, $k$-Graph constructs multiple graph representations based on different subsequence lengths. This feature accommodates variable-length time series without requiring users to predetermine subsequence lengths. Our experimental results reveal that $k$-Graph outperforms current state-of-the-art time series clustering algorithms in accuracy, while providing users with meaningful explanations and interpretations of the clustering outcomes.
Federated Variational Inference for Bayesian Mixture Models
Rao, Jackie, Crowe, Francesca L., Marshall, Tom, Richardson, Sylvia, Kirk, Paul D. W.
We present a federated learning approach for Bayesian model-based clustering of large-scale binary and categorical datasets. We introduce a principled 'divide and conquer' inference procedure using variational inference with local merge and delete moves within batches of the data in parallel, followed by 'global' merge moves across batches to find global clustering structures. We show that these merge moves require only summaries of the data in each batch, enabling federated learning across local nodes without requiring the full dataset to be shared. Empirical results on simulated and benchmark datasets demonstrate that our method performs well in comparison to existing clustering algorithms. We validate the practical utility of the method by applying it to large scale electronic health record (EHR) data.
Feature Engineering Approach to Building Load Prediction: A Case Study for Commercial Building Chiller Plant Optimization in Tropical Weather
Wang, Zhan, Weidong, Chen, Zhifeng, Huang, Islam, Md Raisul, Jon, Chua Kian
In tropical countries with high humidity, air conditioning can account for up to 60% of a building's energy use. For commercial buildings with centralized systems, the efficiency of the chiller plant is vital, and model predictive control provides an effective strategy for optimizing operations through dynamic adjustments based on accurate load predictions. Artificial neural networks are effective for modelling nonlinear systems but are prone to overfitting due to their complexity. Effective feature engineering can mitigate this issue. While weather data are crucial for load prediction, they are often used as raw numerical inputs without advanced processing. Clustering features is a technique that can reduce model complexity and enhance prediction accuracy. Although previous studies have explored clustering algorithms for load prediction, none have applied them to multidimensional weather data, revealing a research gap. This study presents a cooling load prediction model that combines a neural network with Kalman filtering and K-means clustering. Applied to real world data from a commercial skyscraper in Singapore's central business district, the model achieved a 46.5% improvement in prediction accuracy. An optimal chiller sequencing strategy was also developed through genetic algorithm optimization of the predictive load, potentially saving 13.8% in energy. Finally, the study evaluated the integration of thermal energy storage into the chiller plant design, demonstrating potential reductions in capital and operational costs of 26% and 13%, respectively.
zScore: A Universal Decentralised Reputation System for the Blockchain Economy
Udupi, Himanshu, Sahoo, Ashutosh, P., Akshay S., S., Gurukiran, Paul, Parag, Martens, Petrus C.
Modern society functions on trust. The onchain economy, however, is built on the founding principles of trustless peer-to-peer interactions in an adversarial environment without a centralised body of trust and needs a verifiable system to quantify credibility to minimise bad economic activity. We provide a robust framework titled zScore, a core primitive for reputation derived from a wallet's onchain behaviour using state-of-the-art AI neural network models combined with real-world credentials ported onchain through zkTLS. The initial results tested on retroactive data from lending protocols establish a strong correlation between a good zScore and healthy borrowing and repayment behaviour, making it a robust and decentralised alibi for creditworthiness; we highlight significant improvements from previous attempts by protocols like Cred showcasing its robustness. We also present a list of possible applications of our system in Section 5, thereby establishing its utility in rewarding actual value creation while filtering noise and suspicious activity and flagging malicious behaviour by bad actors.
Web Phishing Net (WPN): A scalable machine learning approach for real-time phishing campaign detection
Zia, Muhammad Fahad, Kalidass, Sri Harish
--Phishing is the most prevalent type of cyber-attack today and is recognized as the leading source of data breaches with significant consequences for both individuals and corporations. Web-based phishing attacks are the most frequent with vectors such as social media posts and emails containing links to phishing URLs that once clicked on render host systems vulnerable to more sinister attacks. Research efforts to detect phishing URLs have involved the use of supervised learning techniques that use large amounts of data to train models and have high computational requirements. They also involve analysis of features derived from vectors including email contents thus affecting user privacy. Additionally, they suffer from a lack of resilience against evolution of threats especially with the advent of generative AI techniques to bypass these systems as with AI-generated phishing URLs. Unsupervised methods such as clustering techniques have also been used in phishing detection in the past, however, they are at times unscalable due to the use of pair-wise comparisons. They also lack high detection rates while detecting phishing campaigns. In this paper, we propose an unsupervised learning approach that is not only fast but scalable, as it does not involve pair-wise comparisons. It is able to detect entire campaigns at a time with a high detection rate while preserving user privacy; this includes the recent surge of campaigns with targeted phishing URLs generated by malicious entities using generative AI techniques.