AITopics

doi: 10.1007/978-3-031-78255-8_14

2501.14441

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Sam, Dylan, Chakrabarti, Ayan, Rostamizadeh, Afshin, Ramalingam, Srikumar, Citovsky, Gui, Kumar, Sanjiv

Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

arXiv.org Artificial IntelligenceFeb-13-2025

Similarity between training examples is used to curate pretraining datasets for language models by many methods -- for diversification and to select examples similar to high-quality data. However, similarity is typically measured with off-the-shelf embedding models that are generic or trained for tasks such as retrieval. This paper introduces a framework to analyze the suitability of embedding models specifically for data curation in the language model pretraining setting. We quantify the correlation between similarity in the embedding space to similarity in pretraining loss between different training examples, and how diversifying in the embedding space affects pretraining quality. We analyze a variety of embedding models in our framework, with experiments using the Pile dataset for pretraining a 1.7B parameter decoder-only language model. We find that the embedding models we consider are all useful for pretraining data curation. Moreover, a simple approach of averaging per-token embeddings proves to be surprisingly competitive with more sophisticated embedding models -- likely because the latter are not designed specifically for pretraining data curation. Indeed, we believe our analysis and evaluation framework can serve as a foundation for the design of embedding models that specifically reason about similarity in pretraining datasets.

artificial intelligence, machine learning, natural language, (14 more...)

2502.02494

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Amr Ahmed, Sujith Ravi, Alex J. Smola, Shravan M. Narayanamurthy

FastEx: Hash Clustering with Exponential Families

Neural Information Processing SystemsFeb-12-2025, 02:24:28 GMT

Clustering is a key component in any data analysis toolbox.

data mining, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.28)

Technology:

Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.69)
(2 more...)

Croella, Anna Livia, Piccialli, Veronica, Sudoso, Antonio M.

Strong bounds for large-scale Minimum Sum-of-Squares Clustering

Clustering is a fundamental technique in data analysis and machine learning, used to group similar data points together. Among various clustering methods, the Minimum Sum-of-Squares Clustering (MSSC) is one of the most widely used. MSSC aims to minimize the total squared Euclidean distance between data points and their corresponding cluster centroids. Due to the unsupervised nature of clustering, achieving global optimality is crucial, yet computationally challenging. The complexity of finding the global solution increases exponentially with the number of data points, making exact methods impractical for large-scale datasets. Even obtaining strong lower bounds on the optimal MSSC objective value is computationally prohibitive, making it difficult to assess the quality of heuristic solutions. We address this challenge by introducing a novel method to validate heuristic MSSC solutions through optimality gaps. Our approach employs a divide-and-conquer strategy, decomposing the problem into smaller instances that can be handled by an exact solver. The decomposition is guided by an auxiliary optimization problem, the "anticlustering problem", for which we design an efficient heuristic. Computational experiments demonstrate the effectiveness of the method for large-scale instances, achieving optimality gaps below 3% in most cases while maintaining reasonable computational times. These results highlight the practicality of our approach in assessing feasible clustering solutions for large datasets, bridging a critical gap in MSSC evaluation.

algorithm, artificial intelligence, machine learning, (18 more...)

2502.08397

Country:

Europe > Italy (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Liu, Weisi, Han, Guangzeng, Huang, Xiaolei

Examining and Adapting Time for Multilingual Classification via Mixture of Temporal Experts

Time is implicitly embedded in classification process: classifiers are usually built on existing data while to be applied on future data whose distributions (e.g., label and token) may change. However, existing state-of-the-art classification models merely consider the temporal variations and primarily focus on English corpora, which leaves temporal studies less explored, let alone under multilingual settings. In this study, we fill the gap by treating time as domains (e.g., 2024 vs. 2025), examining temporal effects, and developing a domain adaptation framework to generalize classifiers over time on multiple languages. Our framework proposes Mixture of Temporal Experts (MoTE) to leverage both semantic and data distributional shifts to learn and adapt temporal trends into classification models. Our analysis shows classification performance varies over time across different languages, and we experimentally demonstrate that MoTE can enhance classifier generalizability over temporal data shifts. Our study provides analytic insights and addresses the need for time-aware models that perform robustly in multilingual scenarios.

classification, large language model, machine learning, (20 more...)

2502.08825

Country:

North America > United States > Tennessee > Shelby County > Memphis (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
Europe > Switzerland (0.04)
(7 more...)

Genre: Research Report > New Finding (0.88)

Industry: Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

arXiv.org Machine LearningFeb-12-2025

k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering

Diaz-Rodriguez, Jairo

We introduce k-LLMmeans, a novel modification of the k-means clustering algorithm that utilizes LLMs to generate textual summaries as cluster centroids, thereby capturing contextual and semantic nuances often lost when relying on purely numerical means of document embeddings. This modification preserves the properties of k-means while offering greater interpretability: the cluster centroid is represented by an LLM-generated summary, whose embedding guides cluster assignments. We also propose a mini-batch variant, enabling efficient online clustering for streaming text data and providing real-time interpretability of evolving cluster centroids. Through extensive simulations, we show that our methods outperform vanilla k-means on multiple metrics while incurring only modest LLM usage that does not scale with dataset size. Finally, We present a case study showcasing the interpretability of evolving cluster centroids in sequential text streams. As part of our evaluation, we compile a new dataset from StackExchange, offering a benchmark for text-stream clustering.

centroid, large language model, machine learning, (17 more...)

arXiv.org Machine Learning

2502.09667

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > China > Hong Kong (0.04)
North America > United States > Minnesota (0.04)
(4 more...)

Genre:

Overview (0.68)
Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Nazari, Ali, Weiss, Michael

Fine-Tuning Topics through Weighting Aspect Keywords

Topic modeling often requires examining topics from multiple perspectives to uncover hidden patterns, especially in less explored areas. This paper presents an approach to address this need, utilizing weighted keywords from various aspects derived from a domain knowledge. The research method starts with standard topic modeling. Then, it adds a process consisting of four key steps. First, it defines keywords for each aspect. Second, it gives weights to these keywords based on their relevance. Third, it calculates relevance scores for aspect-weighted keywords and topic keywords to create aspect-topic models. Fourth, it uses these scores to tune relevant new documents. Finally, the generated topic models are interpreted and validated. The findings show that top-scoring documents are more likely to be about the same aspect of a topic. This highlights the model's effectiveness in finding the related documents to the aspects.

information retrieval, machine learning, natural language, (23 more...)

2502.08496

Country:

North America > Canada > Ontario > National Capital Region > Ottawa (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Overview (1.00)
Research Report > New Finding (0.87)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
(7 more...)

Chakravorty, Amitabh, Elsayed, Nelly

A Comparative Study of Machine Learning Algorithms for Stock Price Prediction Using Insider Trading Data

The research paper empirically investigates several machine learning algorithms to forecast stock prices depending on insider trading information. Insider trading offers special insights into market sentiment, pointing to upcoming changes in stock prices. This study examines the effectiveness of algorithms like decision trees, random forests, support vector machines (SVM) with different kernels, and K-Means Clustering using a dataset of Tesla stock transactions. Examining past data from April 2020 to March 2023, this study focuses on how well these algorithms identify trends and forecast stock price fluctuations. The paper uses Recursive Feature Elimination (RFE) and feature importance analysis to optimize the feature set and, hence, increase prediction accuracy. While it requires substantially greater processing time than other models, SVM with the Radial Basis Function (RBF) kernel displays the best accuracy. This paper highlights the trade-offs between accuracy and efficiency in machine learning models and proposes the possibility of pooling multiple data sources to raise prediction performance. The results of this paper aim to help financial analysts and investors in choosing strong algorithms to optimize investment strategies.

algorithm, artificial intelligence, machine learning, (15 more...)

2502.08728

Country:

North America > United States > Ohio > Hamilton County > Cincinnati (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)
North America > Mexico (0.04)
Asia > India > Maharashtra > Mumbai (0.04)

Genre: Research Report > New Finding (0.69)

Industry:

Banking & Finance > Trading (1.00)
Transportation > Ground > Road (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.57)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.36)

Zheng, Fei, Duchateau, Nicolas

Copula-based mixture model identification for subgroup clustering with imaging applications

Model-based clustering techniques have been widely applied to various application areas, while most studies focus on canonical mixtures with unique component distribution form. However, this strict assumption is often hard to satisfy. In this paper, we consider the more flexible Copula-Based Mixture Models (CBMMs) for clustering, which allow heterogeneous component distributions composed by flexible choices of marginal and copula forms. More specifically, we propose an adaptation of the Generalized Iterative Conditional Estimation (GICE) algorithm to identify the CBMMs in an unsupervised manner, where the marginal and copula forms and their parameters are estimated iteratively. GICE is adapted from its original version developed for switching Markov model identification with the choice of realization time. Our CBMM-GICE clustering method is then tested on synthetic two-cluster data (N=2000 samples) with discussion of the factors impacting its convergence. Finally, it is compared to the Expectation Maximization identified mixture models with unique component form on the entire MNIST database (N=70000), and on real cardiac magnetic resonance data (N=276) to illustrate its value for imaging applications.

artificial intelligence, copula, machine learning, (18 more...)

2502.08549

Country:

North America > Canada > British Columbia (0.04)
North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.04)
Europe > France > Auvergne-Rhône-Alpes > Lyon > Lyon (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.66)

Hoang, Cuong Manh, Lee, Yeejin, Kang, Byeongkeun

Generalized Class Discovery in Instance Segmentation

This work addresses the task of generalized class discovery (GCD) in instance segmentation. The goal is to discover novel classes and obtain a model capable of segmenting instances of both known and novel categories, given labeled and unlabeled data. Since the real world contains numerous objects with long-tailed distributions, the instance distribution for each class is inherently imbalanced. To address the imbalanced distributions, we propose an instance-wise temperature assignment (ITA) method for contrastive learning and class-wise reliability criteria for pseudo-labels. The ITA method relaxes instance discrimination for samples belonging to head classes to enhance GCD. The reliability criteria are to avoid excluding most pseudo-labels for tail classes when training an instance segmentation network using pseudo-labels from GCD. Additionally, we propose dynamically adjusting the criteria to leverage diverse samples in the early stages while relying only on reliable pseudo-labels in the later stages. We also introduce an efficient soft attention module to encode object-specific representations for GCD. Finally, we evaluate our proposed method by conducting experiments on two settings: COCO$_{half}$ + LVIS and LVIS + Visual Genome. The experimental results demonstrate that the proposed method outperforms previous state-of-the-art methods.

artificial intelligence, machine learning, recognition, (15 more...)

2502.08149

Country:

North America > United States > California > Alameda County > Oakland (0.04)
Europe > Switzerland (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)