AITopics

2308.05125

Country:

Europe > Netherlands > South Holland > Leiden (0.27)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
North America > United States > District of Columbia > Washington (0.04)
(3 more...)

Genre:

Research Report > New Finding (0.88)
Research Report > Promising Solution (0.64)
Overview > Innovation (0.64)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Data Science > Data Mining (0.97)

Barmomanesh, Sahar, Miranda-Soberanis, Victor

Toward Improving Predictive Risk Modelling for New Zealand's Child Welfare System Using Clustering Methods

arXiv.org Artificial IntelligenceAug-8-2023

The combination of clinical judgement and predictive risk models crucially assist social workers to segregate children at risk of maltreatment and decide when authorities should intervene. Predictive risk modelling to address this matter has been initiated by several governmental welfare authorities worldwide involving administrative data and machine learning algorithms. While previous studies have investigated risk factors relating to child maltreatment, several gaps remain as to understanding how such risk factors interact and whether predictive risk models perform differently for children with different features. By integrating Principal Component Analysis and K-Means clustering, this paper presents initial findings of our work on the identification of such features as well as their potential effect on current risk modelling frameworks. This approach allows examining existent, unidentified yet, clusters of New Zealand (NZ) children reported with care and protection concerns, as well as to analyse their inner structure, and evaluate the performance of prediction models trained cluster wise. We aim to discover the extent of clustering degree required as an early step in the development of predictive risk models for child maltreatment and so enhance the accuracy of such models intended for use by child protection authorities. The results from testing LASSO logistic regression models trained on identified clusters revealed no significant difference in their performance. The models, however, performed slightly better for two clusters including younger children. our results suggest that separate models might need to be developed for children of certain age to gain additional control over the error rates and to improve model accuracy. While results are promising, more evidence is needed to draw definitive conclusions, and further investigation is necessary.

artificial intelligence, machine learning, maltreatment, (16 more...)

2308.0406

Country:

Oceania > New Zealand > North Island > Auckland Region > Auckland (0.05)
Oceania > New Zealand > North Island > Wellington Region > Wellington (0.04)
North America > United States > Pennsylvania > Allegheny County (0.04)
North America > United States > Colorado (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Law > Family Law (1.00)
Government (1.00)
Health & Medicine > Therapeutic Area (0.94)
Education > Social Development & Welfare > Child Welfare (0.42)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

arXiv.org Artificial IntelligenceAug-7-2023

Wide Gaps and Clustering Axioms

Kłopotek, Mieczysław A.

The widely applied k-means algorithm produces clusterings that violate our expectations with respect to high/low similarity/density and is in conflict with Kleinberg's axiomatic system for distance based clustering algorithms that formalizes those expectations in a natural way. k-means violates in particular the consistency axiom. We hypothesise that this clash is due to the not explicated expectation that the data themselves should have the property of being clusterable in order to expect the algorithm clustering hem to fit a clustering axiomatic system. To demonstrate this, we introduce two new clusterability properties, variational k-separability and residual k-separability and show that then the Kleinberg's consistency axiom holds for k-means operating in the Euclidean or non-Euclidean space. Furthermore, we propose extensions of k-means algorithm that fit approximately the Kleinberg's richness axiom that does not hold for k-means. In this way, we reconcile k-means with Kleinberg's axiomatic framework in Euclidean and non-Euclidean settings. Besides contribution to the theory of axiomatic frameworks of clustering and for clusterability theory, practical contribution is the possibility to construct {datasets for testing purposes of algorithms optimizing k-means cost function. This includes a method of construction of {clusterable data with known in advance global optimum.

algorithm, artificial intelligence, machine learning, (18 more...)

2308.03464

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(3 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

An Unsupervised Machine Learning Approach for Ground-Motion Spectra Clustering and Selection

Bond, R. Bailey, Ren, Pu, Hajjar, Jerome F., Sun, Hao

Clustering analysis of sequence data continues to address many applications in engineering design, aided with the rapid growth of machine learning in applied science. This paper presents an unsupervised machine learning algorithm to extract defining characteristics of earthquake ground-motion spectra, also called latent features, to aid in ground-motion selection (GMS). In this context, a latent feature is a low-dimensional machine-discovered spectral characteristic learned through nonlinear relationships of a neural network autoencoder. Machine discovered latent features can be combined with traditionally defined intensity measures and clustering can be performed to select a representative subgroup from a large ground-motion suite. The objective of efficient GMS is to choose characteristic records representative of what the structure will probabilistically experience in its lifetime. Three examples are presented to validate this approach, including the use of synthetic and field recorded ground-motion datasets. The presented deep embedding clustering of ground-motion spectra has three main advantages: 1. defining characteristics the represent the sparse spectral content of ground-motions are discovered efficiently through training of the autoencoder, 2. domain knowledge is incorporated into the machine learning framework with conditional variables in the deep embedding scheme, and 3. method exhibits excellent performance when compared to a benchmark seismic hazard analysis.

artificial intelligence, machine learning, spectra, (18 more...)

2212.03188

Country:

North America > United States > California (0.46)
Europe (0.28)
Asia > Middle East (0.14)
(2 more...)

Genre: Research Report (1.00)

Industry: Energy > Oil & Gas > Upstream (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Spatial-Temporal Data Mining for Ocean Science: Data, Methodologies, and Opportunities

Yang, Hanchen, Li, Wengen, Wang, Shuyu, Li, Hui, Guan, Jihong, Zhou, Shuigeng, Cao, Jiannong

With the rapid amassing of spatial-temporal (ST) ocean data, many spatial-temporal data mining (STDM) studies have been conducted to address various oceanic issues, including climate forecasting and disaster warning. Compared with typical ST data (e.g., traffic data), ST ocean data is more complicated but with unique characteristics, e.g., diverse regionality and high sparsity. These characteristics make it difficult to design and train STDM models on ST ocean data. To the best of our knowledge, a comprehensive survey of existing studies remains missing in the literature, which hinders not only computer scientists from identifying the research issues in ocean data mining but also ocean scientists to apply advanced STDM techniques. In this paper, we provide a comprehensive survey of existing STDM studies for ocean science. Concretely, we first review the widely-used ST ocean datasets and highlight their unique characteristics. Then, typical ST ocean data quality enhancement techniques are explored. Next, we classify existing STDM studies in ocean science into four types of tasks, i.e., prediction, event detection, pattern mining, and anomaly detection, and elaborate on the techniques for these tasks. Finally, promising research opportunities are discussed. This survey can help scientists from both computer science and ocean science better understand the fundamental concepts, key techniques, and open challenges of STDM for ocean science.

data mining, machine learning, prediction, (19 more...)

2307.10803

Country:

Atlantic Ocean > Gulf of Mexico (0.28)
Asia > China > Hong Kong (0.14)
Atlantic Ocean > Mediterranean Sea (0.14)
(9 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Energy > Renewable (1.00)
Government > Regional Government > North America Government > United States Government (0.93)
Transportation > Marine (0.92)
(2 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
(3 more...)

Unsupervised Representation Learning for Time Series: A Review

Meng, Qianwen, Qian, Hangwei, Liu, Yong, Xu, Yonghui, Shen, Zhiqi, Cui, Lizhen

Unsupervised representation learning approaches aim to learn discriminative feature representations from unlabeled data, without the requirement of annotating every sample. Enabling unsupervised representation learning is extremely crucial for time series data, due to its unique annotation bottleneck caused by its complex characteristics and lack of visual cues compared with other data modalities. In recent years, unsupervised representation learning techniques have advanced rapidly in various domains. However, there is a lack of systematic analysis of unsupervised representation learning approaches for time series. To fill the gap, we conduct a comprehensive literature review of existing rapidly evolving unsupervised representation learning approaches for time series. Moreover, we also develop a unified and standardized library, named ULTS (i.e., Unsupervised Learning for Time Series), to facilitate fast implementations and unified evaluations on various models. With ULTS, we empirically evaluate state-of-the-art approaches, especially the rapidly evolving contrastive learning methods, on 9 diverse real-world datasets. We further discuss practical considerations as well as open research challenges on unsupervised representation learning for time series to facilitate future research in this field.

artificial intelligence, machine learning, representation, (17 more...)

2308.01578

Country:

Asia > Singapore (0.04)
Asia > China > Shandong Province > Jinan (0.04)
Asia > China > Beijing > Beijing (0.04)
(5 more...)

Genre:

Research Report > Promising Solution (1.00)
Overview (1.00)

Industry:

Education (1.00)
Information Technology > Security & Privacy (0.67)
Banking & Finance (0.67)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.69)

Jurečková, Olha, Jureček, Martin, Stamp, Mark, Di Troia, Fabio, Lórencz, Róbert

Classification and Online Clustering of Zero-Day Malware

A large amount of new malware is constantly being generated, which must not only be distinguished from benign samples, but also classified into malware families. For this purpose, investigating how existing malware families are developed and examining emerging families need to be explored. This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. We experimented with seven prevalent malware families from the EMBER dataset, four in the training set and three additional new families in the test set. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families. We classified 97.21% of streaming data with a balanced accuracy of 95.33%. Then, we clustered the remaining data using a self-organizing map, achieving a purity from 47.61% for four clusters to 77.68% for ten clusters. These results indicate that our approach has the potential to be applied to the classification and clustering of zero-day malware into malware families.

artificial intelligence, machine learning, malware, (18 more...)

2305.00605

Country:

Europe > Czechia > Prague (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.68)

arXiv.org Artificial IntelligenceAug-2-2023

A new approach for evaluating internal cluster validation indices

Botta-Dukát, Zoltán

A vast number of different methods are available for unsupervised classification. Since no algorithm and parameter setting performs best in all types of data, there is a need for cluster validation to select the actually best-performing algorithm. Several indices were proposed for this purpose without using any additional (external) information. These internal validation indices can be evaluated by applying them to classifications of datasets with a known cluster structure. Evaluation approaches differ in how they use the information on the ground-truth classification. This paper reviews these approaches, considering their advantages and disadvantages, and then suggests a new approach.

artificial intelligence, machine learning, partition, (16 more...)

2308.03894

Country:

Europe > Austria > Vienna (0.14)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
Europe > Netherlands > South Holland > Leiden (0.04)
(2 more...)

Genre:

Research Report (0.90)
Overview (0.75)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.54)

arXiv.org Artificial IntelligenceAug-2-2023

MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching

Zeng, Xiaocan, Wang, Pengfei, Mao, Yuren, Chen, Lu, Liu, Xiaoze, Gao, Yunjun

Entity Matching (EM), which aims to identify all entity pairs referring to the same real-world entity from relational tables, is one of the most important tasks in real-world data management systems. Due to the labeling process of EM being extremely labor-intensive, unsupervised EM is more applicable than supervised EM in practical scenarios. Traditional unsupervised EM assumes that all entities come from two tables; however, it is more common to match entities from multiple tables in practical applications, that is, multi-table entity matching (multi-table EM). Unfortunately, effective and efficient unsupervised multi-table EM remains under-explored. To fill this gap, this paper formally studies the problem of unsupervised multi-table entity matching and proposes an effective and efficient solution, termed as MultiEM. MultiEM is a parallelable pipeline of enhanced entity representation, table-wise hierarchical merging, and density-based pruning. Extensive experimental results on six real-world benchmark datasets demonstrate the superiority of MultiEM in terms of effectiveness and efficiency.

data mining, machine learning, natural language, (19 more...)

2308.01927

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

arXiv.org Artificial IntelligenceAug-2-2023

Are Easy Data Easy (for K-Means)

Kłopotek, Mieczysław A.

This paper investigates the capability of correctly recovering well-separated clusters by various brands of the $k$-means algorithm. The concept of well-separatedness used here is derived directly from the common definition of clusters, which imposes an interplay between the requirements of within-cluster-homogenicity and between-clusters-diversity. Conditions are derived for a special case of well-separated clusters such that the global minimum of $k$-means cost function coincides with the well-separatedness. An experimental investigation is performed to find out whether or no various brands of $k$-means are actually capable of discovering well separated clusters. It turns out that they are not. A new algorithm is proposed that is a variation of $k$-means++ via repeated {sub}sampling when choosing a seed. The new algorithm outperforms four other algorithms from $k$-means family on the task.

algorithm, artificial intelligence, machine learning, (9 more...)

2308.01926

Country:

Europe > Italy (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Europe > Poland > Masovia Province > Warsaw (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.53)