Goto

Collaborating Authors

 itemset


Interpretable Data Mining of Follicular Thyroid Cancer Ultrasound Features Using Enhanced Association Rules

Zhou, Songlin, Zhou, Tao, Li, Xin, Yau, Stephen Shing-Toung

arXiv.org Artificial Intelligence

Purpose: Thyroid cancer has been a common cancer. Papillary thyroid cancer and follicular thyroid cancer are the two most common types of thyroid cancer. Follicular thyroid cancer lacks distinctive ultrasound signs and is more difficult to diagnose preoperatively than the more prevalent papillary thyroid cancer, and the clinical studies associated with it are less well established. We aimed to analyze the clinical data of follicular thyroid cancer based on a novel data mining tool to identify some clinical indications that may help in preoperative diagnosis. Methods: We performed a retrospective analysis based on case data collected by the Department of General Surgery of Peking University Third Hospital between 2010 and 2023. Unlike traditional statistical methods, we improved the association rule mining, a classical data mining method, and proposed new analytical metrics reflecting the malignant association between clinical indications and cancer with the help of the idea of SHAP method in interpretable machine learning. Results: The dataset was preprocessed to contain 1673 cases (in terms of nodes rather than patients), of which 1414 were benign and 259 were malignant nodes. Our analysis pointed out that in addition to some common indicators (e.g., irregular or lobulated nodal margins, uneven thickness halo, hypoechogenicity), there were also some indicators with strong malignant associations, such as nodule-in-nodule pattern, trabecular pattern, and low TSH scores. In addition, our results suggest that the combination of Hashimoto's thyroiditis may also have a strong malignant association. Conclusion: In the preoperative diagnosis of nodules suspected of follicular thyroid cancer, multiple clinical indications should be considered for a more accurate diagnosis. The diverse malignant associations identified in our study may serve as a reference for clinicians in related fields.


Mining Voter Behaviour and Confidence: A Rule-Based Analysis of the 2022 U.S. Elections

Jubair, Md Al, Arefin, Mohammad Shamsul, Reza, Ahmed Wasif

arXiv.org Artificial Intelligence

This study explores the relationship between voter trust and their experiences during elections by applying a rule-based data mining technique to the 2022 Survey of the Performance of American Elections (SPAE). Using the Apriori algorithm and setting parameters to capture meaningful associations (support >= 3%, confidence >= 60%, and lift > 1.5), the analysis revealed a strong connection between demographic attributes and voting-related challenges, such as registration hurdles, accessibility issues, and queue times. For instance, respondents who indicated that accessing polling stations was "very easy" and who reported moderate confidence were found to be over six times more likely (lift = 6.12) to trust their county's election outcome and experience no registration issues. A further analysis, which adjusted the support threshold to 2%, specifically examined patterns among minority voters. It revealed that 98.16 percent of Black voters who reported easy access to polling locations also had smooth registration experiences. Additionally, those who had high confidence in the vote-counting process were almost two times as likely to identify as Democratic Party supporters. These findings point to the important role that enhancing voting access and offering targeted support can play in building trust in the electoral system, particularly among marginalized communities.


Hashing for Fast Pattern Set Selection

Karjalainen, Maiju, Miettinen, Pauli

arXiv.org Artificial Intelligence

Pattern set mining, which is the task of finding a good set of patterns instead of all patterns, is a fundamental problem in data mining. Many different definitions of what constitutes a good set have been proposed in recent years. In this paper, we consider the reconstruction error as a proxy measure for the goodness of the set, and concentrate on the adjacent problem of how to find a good set efficiently. We propose a method based on bottom-k hashing for efficiently selecting the set and extend the method for the common case where the patterns might only appear in approximate form in the data. Our approach has applications in tiling databases, Boolean matrix factorization, and redescription mining, among others. We show that our hashing-based approach is significantly faster than the standard greedy algorithm while obtaining almost equally good results in both synthetic and real-world data sets.


Alice and the Caterpillar: A more descriptive null model for assessing data mining results

Preti, Giulia, Morales, Gianmarco De Francisci, Riondato, Matteo

arXiv.org Artificial Intelligence

We introduce novel null models for assessing the results obtained from observed binary transactional and sequence datasets, using statistical hypothesis testing. Our null models maintain more properties of the observed dataset than existing ones. Specifically, they preserve the Bipartite Joint Degree Matrix of the bipartite (multi-)graph corresponding to the dataset, which ensures that the number of caterpillars, i.e., paths of length three, is preserved, in addition to other properties considered by other models. We describe Alice, a suite of Markov chain Monte Carlo algorithms for sampling datasets from our null models, based on a carefully defined set of states and efficient operations to move between them. The results of our experimental evaluation show that Alice mixes fast and scales well, and that our null model finds different significant results than ones previously considered in the literature.


Identifying and Characterising Higher Order Interactions in Mobility Networks Using Hypergraphs

Sambaturu, Prathyush, Gutierrez, Bernardo, Kraemer, Moritz U. G.

arXiv.org Artificial Intelligence

Human mobility data is crucial for understanding patterns of movement across geographical regions, with applications spanning urban planning[1], transportation systems design[2], infectious disease modeling and control [3, 4], and social dynamics studies [5]. Traditionally, mobility data has been represented using flow networks[6, 7] or colocation matrices [8], where the primary representation is via pairwise interactions. In flow networks, this means directed edges represent the movement of individuals between two locations; colocation matrices measure the probability that a random individual from a region is colocated with a random individual from another region at the same location. These data types and their pairwise representation structure have been used to identify the spatial scales and regularity of human mobility, but have inherent limitations in their capacity to capture more complex patterns of human movement involving higher-order interactions between locations - that is, group of locations that are frequently visited by many individuals within a period of time (e.g., a week) and revisited regularly over time. Higher-order interactions between locations can contain crucial information under certain scenarios.


Meaningless is better: hashing bias-inducing words in LLM prompts improves performance in logical reasoning and statistical learning

Chadimová, Milena, Jurášek, Eduard, Kliegr, Tomáš

arXiv.org Artificial Intelligence

This paper introduces a novel method, referred to as "hashing", which involves masking potentially bias-inducing words in large language models (LLMs) with hash-like meaningless identifiers to reduce cognitive biases and reliance on external knowledge. The method was tested across three sets of experiments involving a total of 490 prompts. Statistical analysis using chi-square tests showed significant improvements in all tested scenarios, which covered LLama, ChatGPT, Copilot, Gemini and Mixtral models. In the first experiment, hashing decreased the fallacy rate in a modified version of the "Linda" problem aimed at evaluating susceptibility to cognitive biases. In the second experiment, it improved LLM results on the frequent itemset extraction task. In the third experiment, we found hashing is also effective when the Linda problem is presented in a tabular format rather than text, indicating that the technique works across various input representations. Overall, the method was shown to improve bias reduction and incorporation of external knowledge. Despite bias reduction, hallucination rates were inconsistently reduced across types of LLM models. These findings suggest that masking bias-inducing terms can improve LLM performance, although its effectiveness is model- and task-dependent.


RPS: A Generic Reservoir Patterns Sampler

Diop, Lamine, Plantevit, Marc, Soulet, Arnaud

arXiv.org Artificial Intelligence

Efficient learning from streaming data is important for modern data analysis due to the continuous and rapid evolution of data streams. Despite significant advancements in stream pattern mining, challenges persist, particularly in managing complex data streams like sequential and weighted itemsets. While reservoir sampling serves as a fundamental method for randomly selecting fixed-size samples from data streams, its application to such complex patterns remains largely unexplored. In this study, we introduce an approach that harnesses a weighted reservoir to facilitate direct pattern sampling from streaming batch data, thus ensuring scalability and efficiency. We present a generic algorithm capable of addressing temporal biases and handling various pattern types, including sequential, weighted, and unweighted itemsets. Through comprehensive experiments conducted on real-world datasets, we evaluate the effectiveness of our method, showcasing its ability to construct accurate incremental online classifiers for sequential data. Our approach not only enables previously unusable online machine learning models for sequential data to achieve accuracy comparable to offline baselines but also represents significant progress in the development of incremental online sequential itemset classifiers.


Geo-FuB: A Method for Constructing an Operator-Function Knowledge Base for Geospatial Code Generation Tasks Using Large Language Models

Hou, Shuyang, Zhao, Anqi, Liang, Jianyuan, Shen, Zhangxiao, Wu, Huayi

arXiv.org Artificial Intelligence

The rise of spatiotemporal data and the need for efficient geospatial modeling have spurred interest in automating these tasks with large language models (LLMs). However, general LLMs often generate errors in geospatial code due to a lack of domain-specific knowledge on functions and operators. To address this, a retrieval-augmented generation (RAG) approach, utilizing an external knowledge base of geospatial functions and operators, is proposed. This study introduces a framework to construct such a knowledge base, leveraging geospatial script semantics. The framework includes: Function Semantic Framework Construction (Geo-FuSE), Frequent Operator Combination Statistics (Geo-FuST), and Semantic Mapping (Geo-FuM). Techniques like Chain-of-Thought, TF-IDF, and the APRIORI algorithm are utilized to derive and align geospatial functions. An example knowledge base, Geo-FuB, built from 154,075 Google Earth Engine scripts, is available on GitHub. Evaluation metrics show a high accuracy, reaching 88.89% overall, with structural and semantic accuracies of 92.03% and 86.79% respectively. Geo-FuB's potential to optimize geospatial code generation through the RAG and fine-tuning paradigms is highlighted.


Explainability of Highly Associated Fuzzy Churn Patterns in Binary Classification

Wang, D. Y. C., Jordanger, Lars Arne, Lin, Jerry Chun-Wei

arXiv.org Artificial Intelligence

Customer churn, particularly in the telecommunications sector, influences both costs and profits. As the explainability of models becomes increasingly important, this study emphasizes not only the explainability of customer churn through machine learning models, but also the importance of identifying multivariate patterns and setting soft bounds for intuitive interpretation. The main objective is to use a machine learning model and fuzzy-set theory with top-\textit{k} HUIM to identify highly associated patterns of customer churn with intuitive identification, referred to as Highly Associated Fuzzy Churn Patterns (HAFCP). Moreover, this method aids in uncovering association rules among multiple features across low, medium, and high distributions. Such discoveries are instrumental in enhancing the explainability of findings. Experiments show that when the top-5 HAFCPs are included in five datasets, a mixture of performance results is observed, with some showing notable improvements. It becomes clear that high importance features enhance explanatory power through their distribution and patterns associated with other features. As a result, the study introduces an innovative approach that improves the explainability and effectiveness of customer churn prediction models.


Towards Explainable Automated Data Quality Enhancement without Domain Knowledge

Sarr, Djibril

arXiv.org Machine Learning

In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset, regardless of its specific content, focusing on both textual and numerical data. Our primary objective is to address three fundamental types of defects: absence, redundancy, and incoherence. At the heart of our approach lies a rigorous demand for both explainability and interpretability, ensuring that the rationale behind the identification and correction of data anomalies is transparent and understandable. To achieve this, we adopt a hybrid approach that integrates statistical methods with machine learning algorithms. Indeed, by leveraging statistical techniques alongside machine learning, we strike a balance between accuracy and explainability, enabling users to trust and comprehend the assessment process. Acknowledging the challenges associated with automating the data quality assessment process, particularly in terms of time efficiency and accuracy, we adopt a pragmatic strategy, employing resource-intensive algorithms only when necessary, while favoring simpler, more efficient solutions whenever possible. Through a practical analysis conducted on a publicly provided dataset, we illustrate the challenges that arise when trying to enhance data quality while keeping explainability. We demonstrate the effectiveness of our approach in detecting and rectifying missing values, duplicates and typographical errors as well as the challenges remaining to be addressed to achieve similar accuracy on statistical outliers and logic errors under the constraints set in our work.