Goto

Collaborating Authors

 Clustering


Maximum Likelihood Estimation on Stochastic Blockmodels for Directed Graph Clustering

arXiv.org Machine Learning

This paper studies the directed graph clustering problem through the lens of statistics, where we formulate clustering as estimating underlying communities in the directed stochastic block model (DSBM). We conduct the maximum likelihood estimation (MLE) on the DSBM and thereby ascertain the most probable community assignment given the observed graph structure. In addition to the statistical point of view, we further establish the equivalence between this MLE formulation and a novel flow optimization heuristic, which jointly considers two important directed graph statistics: edge density and edge orientation. Building on this new formulation of directed clustering, we introduce two efficient and interpretable directed clustering algorithms, a spectral clustering algorithm and a semidefinite programming based clustering algorithm. We provide a theoretical upper bound on the number of misclustered vertices of the spectral clustering algorithm using tools from matrix perturbation theory. We compare, both quantitatively and qualitatively, our proposed algorithms with existing directed clustering methods on both synthetic and real-world data, thus providing further ground to our theoretical contributions. Keywords: graph clustering, directed graphs, maximum likelihood estimation, spectral methods, matrix perturbation analysis, semidefinite programming. Authors are listed in alphabetical order. This is the corresponding author.


Integrating behavior analysis with machine learning to predict online learning performance: A scientometric review and empirical study

arXiv.org Artificial Intelligence

The interest in predicting online learning performance using ML algorithms has been steadily increasing. We first conducted a scientometric analysis to provide a systematic review of research in this area. The findings show that most existing studies apply the ML methods without considering learning behavior patterns, which may compromise the prediction accuracy and precision of the ML methods. This study proposes an integration framework that blends learning behavior analysis with ML algorithms to enhance the prediction accuracy of students' online learning performance. Specifically, the framework identifies distinct learning patterns among students by employing clustering analysis and implements various ML algorithms to predict performance within each pattern. For demonstration, the integration framework is applied to a real dataset from edX and distinguishes two learning patterns, as in, low autonomy students and motivated students. The results show that the framework yields nearly perfect prediction performance for autonomous students and satisfactory performance for motivated students. Additionally, this study compares the prediction performance of the integration framework to that of directly applying ML methods without learning behavior analysis using comprehensive evaluation metrics. The results consistently demonstrate the superiority of the integration framework over the direct approach, particularly when integrated with the best-performing XGBoosting method. Moreover, the framework significantly improves prediction accuracy for the motivated students and for the worst-performing random forest method. This study also evaluates the importance of various learning behaviors within each pattern using LightGBM with SHAP values. The implications of the integration framework and the results for online education practice and future research are discussed.


A Quantum Fuzzy-based Approach for Real-Time Detection of Solar Coronal Holes

arXiv.org Artificial Intelligence

The detection and analysis of the solar coronal holes (CHs) is an important field of study in the domain of solar physics. Mainly, it is required for the proper prediction of the geomagnetic storms which directly or indirectly affect various space and ground-based systems. For the detection of CHs till date, the solar scientist depends on manual hand-drawn approaches. However, with the advancement of image processing technologies, some automated image segmentation methods have been used for the detection of CHs. In-spite of this, fast and accurate detection of CHs are till a major issues. Here in this work, a novel quantum computing-based fast fuzzy c-mean technique has been developed for fast detection of the CHs region. The task has been carried out in two stages, in first stage the solar image has been segmented using a quantum computing based fast fuzzy c-mean (QCFFCM) and in the later stage the CHs has been extracted out from the segmented image based on image morphological operation. In the work, quantum computing has been used to optimize the cost function of the fast fuzzy c-mean (FFCM) algorithm, where quantum approximate optimization algorithm (QAOA) has been used to optimize the quadratic part of the cost function. The proposed method has been tested for 193 \AA{} SDO/AIA full-disk solar image datasets and has been compared with the existing techniques. The outcome shows the comparable performance of the proposed method with the existing one within a very lesser time.


A Novel Behavior-Based Recommendation System for E-commerce

arXiv.org Artificial Intelligence

The majority of existing recommender systems rely on user ratings, which are limited by the lack of user collaboration and the sparsity problem. To address these issues, this study proposes a behavior-based recommender system that leverages customers' natural behaviors, such as browsing and clicking, on e-commerce platforms. The proposed recommendation system involves clustering active customers, determining neighborhoods, collecting similar users, calculating product reputation based on similar users, and recommending high-reputation products. To overcome the complexity of customer behaviors and traditional clustering methods, an unsupervised clustering approach based on product categories is developed to enhance the recommendation methodology. This study makes notable contributions in several aspects. Firstly, a groundbreaking behavior-based recommendation methodology is developed, incorporating customer behavior to generate accurate and tailored recommendations leading to improved customer satisfaction and engagement. Secondly, an original unsupervised clustering method, focusing on product categories, enables more precise clustering and facilitates accurate recommendations. Finally, an approach to determine neighborhoods for active customers within clusters is established, ensuring grouping of customers with similar behavioral patterns to enhance recommendation accuracy and relevance. The proposed recommendation methodology and clustering method contribute to improved recommendation performance, offering valuable insights for researchers and practitioners in the field of e-commerce recommendation systems. Additionally, the proposed method outperforms benchmark methods in experiments conducted using a behavior dataset from the well-known e-commerce site Alibaba.


Superior Parallel Big Data Clustering through Competitive Stochastic Sample Size Optimization in Big-means

arXiv.org Artificial Intelligence

This paper introduces a novel K-means clustering algorithm, an advancement on the conventional Big-means methodology. The proposed method efficiently integrates parallel processing, stochastic sampling, and competitive optimization to create a scalable variant designed for big data applications. It addresses scalability and computation time challenges typically faced with traditional techniques. The algorithm adjusts sample sizes dynamically for each worker during execution, optimizing performance. Data from these sample sizes are continually analyzed, facilitating the identification of the most efficient configuration. By incorporating a competitive element among workers using different sample sizes, efficiency within the Big-means algorithm is further stimulated. In essence, the algorithm balances computational time and clustering quality by employing a stochastic, competitive sampling strategy in a parallel computing setting.


A Personalized Video-Based Hand Taxonomy: Application for Individuals with Spinal Cord Injury

arXiv.org Artificial Intelligence

Hand function is critical for our interactions and quality of life. Spinal cord injuries (SCI) can impair hand function, reducing independence. A comprehensive evaluation of function in home and community settings requires a hand grasp taxonomy for individuals with impaired hand function. Developing such a taxonomy is challenging due to unrepresented grasp types in standard taxonomies, uneven data distribution across injury levels, and limited data. This study aims to automatically identify the dominant distinct hand grasps in egocentric video using semantic clustering. Egocentric video recordings collected in the homes of 19 individual with cervical SCI were used to cluster grasping actions with semantic significance. A deep learning model integrating posture and appearance data was employed to create a personalized hand taxonomy. Quantitative analysis reveals a cluster purity of 67.6% +- 24.2% with with 18.0% +- 21.8% redundancy. Qualitative assessment revealed meaningful clusters in video content. This methodology provides a flexible and effective strategy to analyze hand function in the wild. It offers researchers and clinicians an efficient tool for evaluating hand function, aiding sensitive assessments and tailored intervention plans.


Online Tree Reconstruction and Forest Inventory on a Mobile Robotic System

arXiv.org Artificial Intelligence

Terrestrial laser scanning (TLS) is the standard technique used to create accurate point clouds for digital forest inventories. However, the measurement process is demanding, requiring up to two days per hectare for data collection, significant data storage, as well as resource-heavy post-processing of 3D data. In this work, we present a real-time mapping and analysis system that enables online generation of forest inventories using mobile laser scanners that can be mounted e.g. on mobile robots. Given incrementally created and locally accurate submaps-data payloads-our approach extracts tree candidates using a custom, Voronoi-inspired clustering algorithm. Tree candidates are reconstructed using an adapted Hough algorithm, which enables robust modeling of the tree stem. Further, we explicitly incorporate the incremental nature of the data collection by consistently updating the database using a pose graph LiDAR SLAM system. This enables us to refine our estimates of the tree traits if an area is revisited later during a mission. We demonstrate competitive accuracy to TLS or manual measurements using laser scanners that we mounted on backpacks or mobile robots operating in conifer, broad-leaf and mixed forests. Our results achieve RMSE of 1.93 cm, a bias of 0.65 cm and a standard deviation of 1.81 cm (averaged across these sequences)-with no post-processing required after the mission is complete.


Topic Detection and Tracking with Time-Aware Document Embeddings

arXiv.org Artificial Intelligence

The time at which a message is communicated is a vital piece of metadata in many real-world natural language processing tasks such as Topic Detection and Tracking (TDT). TDT systems aim to cluster a corpus of news articles by event, and in that context, stories that describe the same event are likely to have been written at around the same time. Prior work on time modeling for TDT takes this into account, but does not well capture how time interacts with the semantic nature of the event. For example, stories about a tropical storm are likely to be written within a short time interval, while stories about a movie release may appear over weeks or months. In our work, we design a neural method that fuses temporal and textual information into a single representation of news documents for event detection. We fine-tune these time-aware document embeddings with a triplet loss architecture, integrate the model into downstream TDT systems, and evaluate the systems on two benchmark TDT data sets in English. In the retrospective setting, we apply clustering algorithms to the time-aware embeddings and show substantial improvements over baselines on the News2013 data set. In the online streaming setting, we add our document encoder to an existing state-of-the-art TDT pipeline and demonstrate that it can benefit the overall performance. We conduct ablation studies on the time representation and fusion algorithm strategies, showing that our proposed model outperforms alternative strategies. Finally, we probe the model to examine how it handles recurring events more effectively than previous TDT systems.


Towards Explainable Clustering: A Constrained Declarative based Approach

arXiv.org Artificial Intelligence

The domain of explainable AI is of interest in all Machine Learning fields, and it is all the more important in clustering, an unsupervised task whose result must be validated by a domain expert. We aim at finding a clustering that has high quality in terms of classic clustering criteria and that is explainable, and we argue that these two dimensions must be considered when building the clustering. We consider that a good global explanation of a clustering should give the characteristics of each cluster taking into account their abilities to describe its objects (coverage) while distinguishing it from the other clusters (discrimination). Furthermore, we aim at leveraging expert knowledge, at different levels, on the structure of the expected clustering or on its explanations. In our framework an explanation of a cluster is a set of patterns, and we propose a novel interpretable constrained clustering method called ECS for declarative clustering with Explainabilty-driven Cluster Selection that integrates structural or domain expert knowledge expressed by means of constraints. It is based on the notion of coverage and discrimination that are formalized at different levels (cluster / clustering), each allowing for exceptions through parameterized thresholds. Our method relies on four steps: generation of a set of partitions, computation of frequent patterns for each cluster, pruning clusters that violates some constraints, and selection of clusters and associated patterns to build an interpretable clustering. This last step is combinatorial and we have developed a Constraint-Programming (CP) model to solve it. The method can integrate prior knowledge in the form of user constraints, both before or in the CP model.


EndToEndML: An Open-Source End-to-End Pipeline for Machine Learning Applications

arXiv.org Artificial Intelligence

Artificial intelligence (AI) techniques are widely applied in the life sciences. However, applying innovative AI techniques to understand and deconvolute biological complexity is hindered by the learning curve for life science scientists to understand and use computing languages. An open-source, user-friendly interface for AI models, that does not require programming skills to analyze complex biological data will be extremely valuable to the bioinformatics community. With easy access to different sequencing technologies and increased interest in different 'omics' studies, the number of biological datasets being generated has increased and analyzing these high-throughput datasets is computationally demanding. The majority of AI libraries today require advanced programming skills as well as machine learning, data preprocessing, and visualization skills. In this research, we propose a web-based end-to-end pipeline that is capable of preprocessing, training, evaluating, and visualizing machine learning (ML) models without manual intervention or coding expertise. By integrating traditional machine learning and deep neural network models with visualizations, our library assists in recognizing, classifying, clustering, and predicting a wide range of multi-modal, multi-sensor datasets, including images, languages, and one-dimensional numerical data, for drug discovery, pathogen classification, and medical diagnostics.