AITopics

2507.01696

Country:

North America > United States (0.14)
North America > Canada > Ontario > Toronto (0.14)
Asia > Afghanistan > Parwan Province > Charikar (0.05)
(4 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Peña-Asensio, Eloy, Ferrari, Fabio

Meteoroid stream identification with HDBSCAN unsupervised clustering algorithm

arXiv.org Artificial IntelligenceJul-3-2025

Accurate identification of meteoroid streams is central to understanding their origins and evolution. However, overlapping clusters and background noise hinder classification, an issue amplified for missions such as ESA's LUMIO that rely on meteor shower observations to infer lunar meteoroid impact parameters. This study evaluates the performance of the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm for unsupervised meteoroid stream identification, comparing its outcomes with the established Cameras for All-Sky Meteor Surveillance (CAMS) look-up table method. We analyze the CAMS Meteoroid Orbit Database v3.0 using three feature vectors: LUTAB (CAMS geocentric parameters), ORBIT (heliocentric orbital elements), and GEO (adapted geocentric parameters). HDBSCAN is applied with varying minimum cluster sizes and two cluster selection methods (eom and leaf). To align HDBSCAN clusters with CAMS classifications, the Hungarian algorithm determines the optimal mapping. Clustering performance is assessed via the Silhouette score, Normalized Mutual Information, and F1 score, with Principal Component Analysis further supporting the analysis. With the GEO vector, HDBSCAN confirms 39 meteoroid streams, 21 strongly aligning with CAMS. The ORBIT vector identifies 30 streams, 13 with high matching scores. Less active showers pose identification challenges. The eom method consistently yields superior performance and agreement with CAMS. Although HDBSCAN requires careful selection of the minimum cluster size, it delivers robust, internally consistent clusters and outperforms the look-up table method in statistical coherence. These results underscore HDBSCAN's potential as a mathematically consistent alternative for meteoroid stream identification, although further validation is needed to assess physical validity.

artificial intelligence, data mining, machine learning, (18 more...)

2507.01501

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Europe > Italy > Lombardy > Milan (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Government (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Nikolikj, Ana, Ochoa, Gabriela, Eftimov, Tome

Customized Exploration of Landscape Features Driving Multi-Objective Combinatorial Optimization Performance

arXiv.org Artificial IntelligenceJul-3-2025

We present an analysis of landscape features for predicting the performance of multi-objective combinatorial optimization algorithms. We consider features from the recently proposed compressed Pareto Local Optimal Solutions Networks (C-PLOS-net) model of combinatorial landscapes. The benchmark instances are a set of rmnk-landscapes with 2 and 3 objectives and various levels of ruggedness and objective correlation. We consider the performance of three algorithms -- Pareto Local Search (PLS), Global Simple EMO Optimizer (GSEMO), and Non-dominated Sorting Genetic Algorithm (NSGA-II) - using the resolution and hypervolume metrics. Our tailored analysis reveals feature combinations that influence algorithm performance specific to certain landscapes. This study provides deeper insights into feature importance, tailored to specific rmnk-landscapes and algorithms.

artificial intelligence, evolutionary algorithm, machine learning, (19 more...)

2507.01638

Country:

Europe > Slovenia (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > Scotland > Stirling > Stirling (0.04)
(4 more...)

Genre: Research Report > Experimental Study (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.90)
(2 more...)

Szwagier, Tom, Mattei, Pierre-Alexandre, Bouveyron, Charles, Pennec, Xavier

Parsimonious Gaussian mixture models with piecewise-constant eigenvalue profiles

arXiv.org Machine LearningJul-3-2025

Gaussian mixture models (GMMs) are ubiquitous in statistical learning, particularly for unsupervised problems. While full GMMs suffer from the overparameterization of their covariance matrices in high-dimensional spaces, spherical GMMs (with isotropic covariance matrices) certainly lack flexibility to fit certain anisotropic distributions. Connecting these two extremes, we introduce a new family of parsimonious GMMs with piecewise-constant covariance eigenvalue profiles. These extend several low-rank models like the celebrated mixtures of probabilistic principal component analyzers (MPPCA), by enabling any possible sequence of eigenvalue multiplicities. If the latter are prespecified, then we can naturally derive an expectation-maximization (EM) algorithm to learn the mixture parameters. Otherwise, to address the notoriously-challenging issue of jointly learning the mixture parameters and hyperparameters, we propose a componentwise penalized EM algorithm, whose monotonicity is proven. We show the superior likelihood-parsimony tradeoffs achieved by our models on a variety of unsupervised experiments: density fitting, clustering and single-image denoising.

artificial intelligence, bayesian inference, machine learning, (19 more...)

arXiv.org Machine Learning

2507.01542

Country:

Europe > France (0.14)
North America > Canada > Ontario > Toronto (0.14)
North America > United States > Wisconsin (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)

Franssen, Christian, van Lelyveld, Iman, Heidergott, Bernd

A Practical Guide to Interpretable Role-Based Clustering in Multi-Layer Financial Networks

Understanding the functional roles of financial institutions within interconnected markets is critical for effective supervision, systemic risk assessment, and resolution planning. We propose an interpretable role-based clustering approach for multi-layer financial networks, designed to identify the functional positions of institutions across different market segments. Our method follows a general clustering framework defined by proximity measures, cluster evaluation criteria, and algorithm selection. We construct explainable node embeddings based on egonet features that capture both direct and indirect trading relationships within and across market layers. Using transaction-level data from the ECB's Money Market Statistical Reporting (MMSR), we demonstrate how the approach uncovers heterogeneous institutional roles such as market intermediaries, cross-segment connectors, and peripheral lenders or borrowers. The results highlight the flexibility and practical value of role-based clustering in analyzing financial networks and understanding institutional behavior in complex market structures.

data mining, machine learning, node, (19 more...)

2507.006

Country:

North America > United States (0.14)
Europe > Netherlands > South Holland > Leiden (0.04)
South America > Brazil (0.04)
(3 more...)

Genre:

Research Report (0.50)
Workflow (0.48)

Industry:

Banking & Finance > Economy (0.69)
Banking & Finance > Trading (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Chakraborty, Mohna, Kulkarni, Adithya, Li, Qi

Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios

Prompt-based methods leverage the knowledge of pre-trained language models (PLMs) trained with a masked language modeling (MLM) objective; however, these methods are sensitive to template, verbalizer, and few-shot instance selection, particularly in cold-start settings with no labeled data. Existing studies overlook the dependency between instances and verbalizers, where instance-label probabilities depend on verbalizer token proximity in the embedding space. To address this, we propose COLDSELECT, a joint verbalizer and instance selection approach that models data diversity. COLDSELECT maps PLM vocabulary and $h_{[MASK]}$ embeddings into a shared space, applying dimensionality reduction and clustering to ensure efficient and diverse selection. By optimizing for minimal uncertainty and maximal diversity, COLDSELECT captures data relationships effectively. Experiments on eight benchmarks demonstrate COLDSELECT's superiority in reducing uncertainty and enhancing generalization, outperforming baselines in verbalizer and few-shot instance selection for cold-start scenarios.

data mining, machine learning, natural language, (20 more...)

2507.0033

Country: North America > United States (0.68)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Westlin, Christiana, Singh, Ashutosh, Erdogmus, Deniz, Stratis, Georgios, Barrett, Lisa Feldman

Exploring Theory-Laden Observations in the Brain Basis of Emotional Experience

In the science of emotion, it is widely assumed that folk emotion categories form a biological and psychological typology, and studies are routinely designed and analyzed to identify emotion-specific patterns. This approach shapes the observations that studies report, ultimately reinforcing the assumption that guided the investigation. Here, we reanalyzed data from one such typologically-guided study that reported mappings between individual brain patterns and group-averaged ratings of 34 emotion categories. Our reanalysis was guided by an alternative view of emotion categories as populations of variable, situated instances, and which predicts a priori that there will be significant variation in brain patterns within a category across instances. Correspondingly, our analysis made minimal assumptions about the structure of the variance present in the data. As predicted, we did not observe the original mappings and instead observed significant variation across individuals. These findings demonstrate how starting assumptions can ultimately impact scientific conclusions and suggest that a hypothesis must be supported using multiple analytic methods before it is taken seriously.

artificial intelligence, machine learning, participant, (15 more...)

2507.0032

Country: North America > United States > Massachusetts (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.88)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (0.94)
Health & Medicine > Diagnostic Medicine > Imaging (0.93)
Government > Regional Government > North America Government > United States Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Data Science (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Kappagantula, Vignesh Ram Nithin, Hassantabar, Shayan

Room Scene Discovery and Grouping in Unstructured Vacation Rental Image Collections

The rapid growth of vacation rental (VR) platforms has led to an increasing volume of property images, often uploaded without structured categorization. This lack of organization poses significant challenges for travelers attempting to understand the spatial layout of a property, particularly when multiple rooms of the same type are present. To address this issue, we introduce an effective approach for solving the room scene discovery and grouping problem, as well as identifying bed types within each bedroom group. This grouping is valuable for travelers to comprehend the spatial organization, layout, and the sleeping configuration of the property. We propose a computationally efficient machine learning pipeline characterized by low latency and the ability to perform effectively with sample-efficient learning, making it well-suited for real-time and data-scarce environments. The pipeline integrates a supervised room-type detection model, a supervised overlap detection model to identify the overlap similarity between two images, and a clustering algorithm to group the images of the same space together using the similarity scores. Additionally, the pipeline maps each bedroom group to the corresponding bed types specified in the property's metadata, based on the visual content present in the group's images using a Multi-modal Large Language Model (MLLM) model. We evaluate the aforementioned models individually and also assess the pipeline in its entirety, observing strong performance that significantly outperforms established approaches such as contrastive learning and clustering with pretrained embeddings.

artificial intelligence, machine learning, natural language, (17 more...)

2507.00263

Genre: Research Report (0.50)

Industry: Banking & Finance > Real Estate (0.71)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

arXiv.org Artificial IntelligenceJul-1-2025

Hierarchical Quantized Diffusion Based Tree Generation Method for Hierarchical Representation and Lineage Analysis

Zang, Zelin, Li, WenZhe, Chen, Fei, Xu, Yongjie, Yu, Chang, Lei, Zhen, Li, Stan Z.

In single-cell research, tracing and analyzing high-throughput single-cell differentiation trajectories is crucial for understanding complex biological processes. Key to this is the modeling and generation of hierarchical data that represents the intrinsic structure within datasets. Traditional methods face limitations in terms of computational cost, performance, generative capacity, and stability. Recent VAEs based approaches have made strides in addressing these challenges but still require specialized network modules for each tree branch, limiting their stability and ability to capture deep hierarchical relationships. To overcome these challenges, we introduce diffusion-based approach called HDTree. HDTree captures tree relationships within a hierarchical latent space using a unified hierarchical codebook and quantized diffusion processes to model tree node transitions. This method improves stability by eliminating branch-specific modules and enhancing generative capacity through gradual hierarchical changes simulated by the diffusion process. HDTree's effectiveness is demonstrated through comparisons on both general-purpose and single-cell datasets, where it outperforms existing methods in terms of accuracy and performance. These contributions provide a new tool for hierarchical lineage analysis, enabling more accurate and efficient modeling of cellular differentiation paths and offering insights for downstream biological tasks. The code of HDTree is available at anonymous link https://anonymous.4open.science/r/code_HDTree_review-A8DB.

artificial intelligence, deep learning, machine learning, (19 more...)

2506.23287

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > China > Beijing > Beijing (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.68)
Health & Medicine > Therapeutic Area > Hematology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.87)

Mahjourian, Nazanin, Nguyen, Vinh

Sanitizing Manufacturing Dataset Labels Using Vision-Language Models

arXiv.org Artificial IntelligenceJul-1-2025

The success of machine learning models in industrial applications is heavily dependent on the quality of the datasets used to train the models. However, large-scale datasets, specially those constructed from crowd-sourcing and web-scraping, often suffer from label noise, inconsistencies, and errors. This problem is particularly pronounced in manufacturing domains, where obtaining high-quality labels is costly and time-consuming. This paper introduces Vision-Language Sanitization and Refinement (VLSR), which is a vision-language-based framework for label sanitization and refinement in multi-label manufacturing image datasets. This method embeds both images and their associated textual labels into a shared semantic space leveraging the CLIP vision-language model. Then two key tasks are addressed in this process by computing the cosine similarity between embeddings. First, label sanitization is performed to identify irrelevant, misspelled, or semantically weak labels, and surface the most semantically aligned label for each image by comparing image-label pairs using cosine similarity between image and label embeddings. Second, the method applies density-based clustering on text embeddings, followed by iterative cluster merging, to group semantically similar labels into unified label groups. The Factorynet dataset, which includes noisy labels from both human annotations and web-scraped sources, is employed to evaluate the effectiveness of the proposed framework. Experimental results demonstrate that the VLSR framework successfully identifies problematic labels and improves label consistency. This method enables a significant reduction in label vocabulary through clustering, which ultimately enhances the dataset's quality for training robust machine learning models in industrial applications with minimal human intervention.

artificial intelligence, machine learning, natural language, (19 more...)

2506.23465

Country:

North America > United States > Michigan (0.04)
Asia > China (0.04)
North America > United States > Wisconsin > Milwaukee County > Milwaukee (0.04)
(2 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Leisure & Entertainment > Sports (0.46)
Health & Medicine (0.46)
Automobiles & Trucks (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)