AITopics

2407.19414

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > Virginia > Arlington County > Arlington (0.04)
(4 more...)

Genre:

Overview (0.93)
Research Report > New Finding (0.67)

Industry:

Information Technology (0.93)
Health & Medicine (0.93)
Telecommunications (0.67)

Technology:

Information Technology > Data Science > Data Integration (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
(2 more...)

arXiv.org Artificial IntelligenceJul-28-2024

Short-Term Forecasting of Photovoltaic Power Generation Based on Entropy during the Foggy Winter

Yang, Xuan, Dong, Yunxuan, Wu, Thomas

Solar energy is one of the most promising renewable energy resources. Forecasting photovoltaic power generation is an important way to increase photovoltaic penetration. However, the task of photovoltaic forecasting is complicated due to its property of uncertainty, especially in specific regions during the foggy winter. This paper proposes a novel model to accomplish the problem. A developed entropy is created to qualify the uncertainty during the foggy winter. The clustering method and modified retention network are applied to reduce complexity and forecast, respectively. We adopt an optimization to optimize the hyperparameters. Results are validated from the multivariate forecasting model using the dataset from a photovoltaic power station in Jiangsu Province, China. Experiments show that the proposed model improves the forecasting accuracy compared to various models during the foggy winter.

accuracy, denote, forecasting, (16 more...)

2407.19663

Country:

Asia > China > Jiangsu Province (0.34)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.05)
Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report (1.00)

Industry: Energy > Renewable > Solar (1.00)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Yang, Yifan, Payani, Ali, Naghizadeh, Parinaz

Enhancing Group Fairness in Federated Learning through Personalization

arXiv.org Artificial IntelligenceJul-27-2024

Personalized Federated Learning (FL) algorithms collaboratively train customized models for each client, enhancing the accuracy of the learned models on the client's local data (e.g., by clustering similar clients, or by fine-tuning models locally). In this paper, we investigate the impact of such personalization techniques on the group fairness of the learned models, and show that personalization can also lead to improved (local) fairness as an unintended benefit. We begin by illustrating these benefits of personalization through numerical experiments comparing two classes of personalized FL algorithms (clustering and fine-tuning) against a baseline FedAvg algorithm, elaborating on the reasons behind improved fairness using personalized FL, and then providing analytical support. Motivated by these, we further propose a new, Fairness-aware Federated Clustering Algorithm, Fair-FCA, in which clients can be clustered to obtain a (tuneable) fairness-accuracy tradeoff. Through numerical experiments, we demonstrate the ability of Fair-FCA to strike a balance between accuracy and fairness at the client level.

algorithm, fairness, fairness performance, (14 more...)

2407.19331

Country:

North America > United States > Virginia (0.04)
North America > United States > Ohio (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Motegi, Ryosuke, Seki, Yoichi

A simulation study of cluster search algorithms in data set generated by Gaussian mixture models

arXiv.org Machine LearningJul-27-2024

Determining the number of clusters is a fundamental issue in data clustering. Several algorithms have been proposed, including centroid-based algorithms using the Euclidean distance and model-based algorithms using a mixture of probability distributions. Among these, greedy algorithms for searching the number of clusters by repeatedly splitting or merging clusters have advantages in terms of computation time for problems with large sample sizes. However, studies comparing these methods in systematic evaluation experiments still need to be included. This study examines centroid- and model-based cluster search algorithms in various cases that Gaussian mixture models (GMMs) can generate. The cases are generated by combining five factors: dimensionality, sample size, the number of clusters, cluster overlap, and covariance type. The results show that some cluster-splitting criteria based on Euclidean distance make unreasonable decisions when clusters overlap. The results also show that model-based algorithms are insensitive to covariance type and cluster overlap compared to the centroid-based method if the sample size is sufficient. Our cluster search implementation codes are available at https://github.com/lipryou/searchClustK

algorithm, dip-means, interaction effect, (15 more...)

2407.19199

Country:

Asia > Japan (0.04)
North America > United States > Minnesota (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Ghani, Daniyar, Heard, Nicholas A., Passino, Francesco Sanna

Approximate learning of parsimonious Bayesian context trees

arXiv.org Machine LearningJul-27-2024

Models for categorical sequences typically assume exchangeable or first-order dependent sequence elements. These are common assumptions, for example, in models of computer malware traces and protein sequences. Although such simplifying assumptions lead to computational tractability, these models fail to capture long-range, complex dependence structures that may be harnessed for greater predictive power. To this end, a Bayesian modelling framework is proposed to parsimoniously capture rich dependence structures in categorical sequences, with memory efficiency suitable for real-time processing of data streams. Parsimonious Bayesian context trees are introduced as a form of variable-order Markov model with conjugate prior distributions. The novel framework requires fewer parameters than fixed-order Markov models by dropping redundant dependencies and clustering sequential contexts. Approximate inference on the context tree structure is performed via a computationally efficient model-based agglomerative clustering procedure. The proposed framework is tested on synthetic and real-world data examples, and it outperforms existing sequence models when fitted to real protein sequences and honeypot computer terminal sessions.

context tree, markov model, sequence, (15 more...)

2407.19236

Country:

Europe > United Kingdom (0.14)
North America > United States > New York (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)

Genre: Research Report (0.40)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
(2 more...)

Barron, Ryan, Eren, Maksim E., Bhattarai, Manish, Boureima, Ismael, Matuszek, Cynthia, Alexandrov, Boian S.

Binary Bleed: Fast Distributed and Parallel Method for Automatic Model Selection

arXiv.org Artificial IntelligenceJul-26-2024

In several Machine Learning (ML) clustering and dimensionality reduction approaches, such as non-negative matrix factorization (NMF), RESCAL, and K-Means clustering, users must select a hyper-parameter k to define the number of clusters or components that yield an ideal separation of samples or clean clusters. This selection, while difficult, is crucial to avoid overfitting or underfitting the data. Several ML applications use scoring methods (e.g., Silhouette and Davies Boulding scores) to evaluate the cluster pattern stability for a specific k. The score is calculated for different trials over a range of k, and the ideal k is heuristically selected as the value before the model starts overfitting, indicated by a drop or increase in the score resembling an elbow curve plot. While the grid-search method can be used to accurately find a good k value, visiting a range of k can become time-consuming and computationally resource-intensive. In this paper, we introduce the Binary Bleed method based on binary search, which significantly reduces the k search space for these grid-search ML algorithms by truncating the target k values from the search space using a heuristic with thresholding over the scores. Binary Bleed is designed to work with single-node serial, single-node multi-processing, and distributed computing resources. In our experiments, we demonstrate the reduced search space gain over a naive sequential search of the ideal k and the accuracy of the Binary Bleed in identifying the correct k for NMFk, K-Means pyDNMFk, and pyDRESCALk with Silhouette and Davies Boulding scores. We make our implementation of Binary Bleed for the NMF algorithm available on GitHub.

binary bleed, optimal, threshold, (16 more...)

2407.19125

Country:

North America > United States > New Mexico > Los Alamos County > Los Alamos (0.05)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Maryland > Baltimore County (0.04)
North America > United States > Maryland > Baltimore (0.04)

Genre: Research Report > New Finding (0.34)

Industry:

Information Technology > Security & Privacy (0.46)
Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

arXiv.org Artificial IntelligenceJul-26-2024

Embedding And Clustering Your Data Can Improve Contrastive Pretraining

Merrick, Luke

Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified view motivates future lines of research on the organization of contrastive pretraining data.

contrastive pretraining, dataset, embedding and clustering, (13 more...)

2407.18887

Country:

North America > United States > Montana > Flathead County > Kalispell (0.14)
North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
North America > Canada (0.04)
(18 more...)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (1.00)
Law (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
(6 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Wong, Albert, Cheng, Florence Wing Yau, Keung, Ashley, Hercules, Yamileth, Garcia, Mary Alexandra, Lim, Yew-Wei, Pham, Lien

An Iterative Approach to Topic Modelling

arXiv.org Artificial IntelligenceJul-25-2024

Topic modelling has become increasingly popular for summarizing text data, such as social media posts and articles. However, topic modelling is usually completed in one shot. Assessing the quality of resulting topics is challenging. No effective methods or measures have been developed for assessing the results or for making further enhancements to the topics. In this research, we propose we propose to use an iterative process to perform topic modelling that gives rise to a sense of completeness of the resulting topics when the process is complete. Using the BERTopic package, a popular method in topic modelling, we demonstrate how the modelling process can be applied iteratively to arrive at a set of topics that could not be further improved upon using one of the three selected measures for clustering comparison as the decision criteria. This demonstration is conducted using a subset of the COVIDSenti-A dataset. The early success leads us to believe that further research using in using this approach in conjunction with other topic modelling algorithms could be viable.

algorithm, iteration, iterative approach, (13 more...)

2407.17892

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.16)
Europe > Switzerland (0.04)
Asia > Singapore (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

van Vuren, Thomas, Cronk, Thomas, Sanders, Jaron

Estimating the number of clusters of a Block Markov Chain

arXiv.org Machine LearningJul-25-2024

Clustering algorithms frequently require the number of clusters to be chosen in advance, but it is usually not clear how to do this. To tackle this challenge when clustering within sequential data, we present a method for estimating the number of clusters when the data is a trajectory of a Block Markov Chain. Block Markov Chains are Markov Chains that exhibit a block structure in their transition matrix. The method considers a matrix that counts the number of transitions between different states within the trajectory, and transforms this into a spectral embedding whose dimension is set via singular value thresholding. The number of clusters is subsequently estimated via density-based clustering of this spectral embedding, an approach inspired by literature on the Stochastic Block Model. By leveraging and augmenting recent results on the spectral concentration of random matrices with Markovian dependence, we show that the method is asymptotically consistent - in spite of the dependencies between the count matrix's entries, and even when the count matrix is sparse. We also present a numerical evaluation of our method, and compare it to alternatives.

algorithm, algorithm 1, algorithm 2, (17 more...)

2407.18287

Country:

Europe > Netherlands > North Brabant > Eindhoven (0.04)
Europe > Spain > Andalusia > Cádiz Province > Cadiz (0.04)

Genre: Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Efe, Onur, Ozakin, Arkadas

The seismic purifier: An unsupervised approach to seismic signal detection via representation learning

arXiv.org Machine LearningJul-25-2024

In this paper, we develop an unsupervised learning approach to earthquake detection. We train a specific class of deep auto-encoders that learn to reproduce the input waveforms after a data-compressive bottleneck, and then use a simple triggering algorithm at the bottleneck to label waveforms as noise or signal. Our approach is motivated by the intuition that efficient compression of data should represent signals differently from noise, and is facilitated by a time-axis-preserving approach to auto-encoding and intuitively-motivated choices on the architecture and triggering. We demonstrate that the detection performance of the unsupervised approach is comparable to, and in some cases better than, some of the state-of-the-art supervised methods. Moreover, it has strong \emph{cross-dataset generalization}. By experimenting with various modifications, we demonstrate that the detection performance is insensitive to various technical choices made in the algorithm. Our approach has the potential to be useful for other signal detection problems with time series data.

autoencoder, representation, waveform, (15 more...)

2407.18402

Country:

Oceania > New Zealand (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Europe > Italy (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)

Genre: Research Report (1.00)

Industry: Energy (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)