AITopics | top2vec

Collaborating Authors

top2vec

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Hybrid Topic-Semantic Labeling and Graph Embeddings for Unsupervised Legal Document Clustering

Bastola, Deepak, Choi, Woohyeok

arXiv.org Machine LearningSep-3-2025

Legal documents pose unique challenges for text classification due to their domain-specific language and often limited labeled data. This paper proposes a hybrid approach for classifying legal texts by combining unsupervised topic and graph embeddings with a supervised model. We employ Top2Vec to learn semantic document embeddings and automatically discover latent topics, and Node2Vec to capture structural relationships via a bipartite graph of legal documents. The embeddings are combined and clustered using KMeans, yielding coherent groupings of documents. Our computations on a legal document dataset demonstrate that the combined Top2Vec+Node2Vec approach improves clustering quality over text-only or graph-only embeddings. We conduct a sensitivity analysis of hyperparameters, such as the number of clusters and the dimensionality of the embeddings, and demonstrate that our method achieves competitive performance against baseline Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) models. Key findings indicate that while the pipeline presents an innovative approach to unsupervised legal document analysis by combining semantic topic modeling with graph embedding techniques, its efficacy is contingent upon the quality of initial topic generation and the representational power of the chosen embedding models for specialized legal language. Strategic recommendations include the exploration of domain-specific embeddings, more comprehensive hyperparameter tuning for Node2Vec, dynamic determination of cluster numbers, and robust human-in-the-loop validation processes to enhance legal relevance and trustworthiness. The pipeline demonstrates potential for exploratory legal data analysis and as a precursor to supervised learning tasks but requires further refinement and domain-specific adaptation for practical legal applications.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

2509.0099

Country:

North America > United States > California (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Asia > Middle East > Jordan (0.04)
Asia > India (0.04)

Genre:

Research Report (1.00)
Overview (0.88)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

BERTopic for Topic Modeling of Hindi Short Texts: A Comparative Study

Mutsaddi, Atharva, Jamkhande, Anvi, Thakre, Aryan, Haribhakta, Yashodhara

arXiv.org Artificial IntelligenceJan-7-2025

As short text data in native languages like Hindi increasingly appear in modern media, robust methods for topic modeling on such data have gained importance. This study investigates the performance of BERTopic in modeling Hindi short texts, an area that has been under-explored in existing research. Using contextual embeddings, BERTopic can capture semantic relationships in data, making it potentially more effective than traditional models, especially for short and diverse texts. We evaluate BERTopic using 6 different document embedding models and compare its performance against 8 established topic modeling techniques, such as Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Latent Semantic Indexing (LSI), Additive Regularization of Topic Models (ARTM), Probabilistic Latent Semantic Analysis (PLSA), Embedded Topic Model (ETM), Combined Topic Model (CTM), and Top2Vec. The models are assessed using coherence scores across a range of topic counts. Our results reveal that BERTopic consistently outperforms other models in capturing coherent topics from short Hindi texts.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2501.03843

Country: North America > United States (0.47)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Insights Into the Nutritional Prevention of Macular Degeneration based on a Comparative Topic Modeling Approach

Jacaruso, Lucas Cassiel

arXiv.org Artificial IntelligenceNov-17-2023

Topic modeling and text mining are subsets of Natural Language Processing (NLP) with relevance for conducting meta-analysis (MA) and systematic review (SR). For evidence synthesis, the above NLP methods are conventionally used for topic-specific literature searches or extracting values from reports to automate essential phases of SR and MA. Instead, this work proposes a comparative topic modeling approach to analyze reports of contradictory results on the same general research question. Specifically, the objective is to identify topics exhibiting distinct associations with significant results for an outcome of interest by ranking them according to their proportional occurrence in (and consistency of distribution across) reports of significant effects. The proposed method was tested on broad-scope studies addressing whether supplemental nutritional compounds significantly benefit macular degeneration (MD). Four of these were further supported in terms of effectiveness upon conducting a follow-up literature search for validation (omega-3 fatty acids, copper, zeaxanthin, and nitrates). The two not supported by the follow-up literature search (niacin and molybdenum) also had scores in the lowest range under the proposed scoring system, suggesting that the proposed methods score for a given topic may be a viable proxy for its degree of association with the outcome of interest and can be helpful in the search for potentially causal relationships. These results underpin the proposed methods potential to add specificity in understanding effects from broad-scope reports, elucidate topics of interest for future research, and guide evidence synthesis in a systematic and scalable way. All of this is accomplished while yielding valuable insights into the prevention of MD.

artificial intelligence, data mining, natural language, (16 more...)

arXiv.org Artificial Intelligence

2309.00312

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > Canada (0.14)
Europe > Austria > Styria > Graz (0.04)
(3 more...)

Genre:

Research Report > Strength High (1.00)
Research Report > Experimental Study > Negative Result (0.34)

Industry:

Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)
Health & Medicine > Consumer Health (1.00)
Education > Health & Safety > School Nutrition (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
Information Technology > Data Science > Data Mining > Text Mining (0.35)

Add feedback

Exploring the Power of Topic Modeling Techniques in Analyzing Customer Reviews: A Comparative Analysis

Krishnan, Anusuya

arXiv.org Artificial IntelligenceAug-19-2023

The exponential growth of online social network platforms and applications has led to a staggering volume of user-generated textual content, including comments and reviews. Consequently, users often face difficulties in extracting valuable insights or relevant information from such content. To address this challenge, machine learning and natural language processing algorithms have been deployed to analyze the vast amount of textual data available online. In recent years, topic modeling techniques have gained significant popularity in this domain. In this study, we comprehensively examine and compare five frequently used topic modeling methods specifically applied to customer reviews. The methods under investigation are latent semantic analysis (LSA), latent Dirichlet allocation (LDA), non-negative matrix factorization (NMF), pachinko allocation model (PAM), Top2Vec, and BERTopic. By practically demonstrating their benefits in detecting important topics, we aim to highlight their efficacy in real-world scenarios. To evaluate the performance of these topic modeling methods, we carefully select two textual datasets. The evaluation is based on standard statistical evaluation metrics such as topic coherence score. Our findings reveal that BERTopic consistently yield more meaningful extracted topics and achieve favorable results.

machine learning, natural language, topic modeling, (18 more...)

arXiv.org Artificial Intelligence

2308.1152

Country:

Asia > Middle East > UAE (0.14)
Asia > India (0.04)
North America > United States > New York (0.04)
(5 more...)

Genre: Research Report > New Finding (0.87)

Industry:

Information Technology > Services (0.48)
Health & Medicine > Therapeutic Area > Immunology (0.47)
Banking & Finance > Trading (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.91)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Hong Kong Machine Learning Season 4 Episode 4

#artificialintelligenceDec-31-2021, 10:00:26 GMT

We are looking to organize online x in-person meetups on HK island going forward. Thanks to our sponsor Darwinex to help us supporting the various costs. Abstract: We introduce a class of interpretable tree-based models (P-Trees) for analyzing panel data, with iterative and global (instead of recursive and local) splitting criteria to avoid overfitting and improve model performance. We apply P-Tree to generate a stochastic discount factor model and test assets for cross-sectional asset pricing. Unlike other tree algorithms, P-Trees accommodate imbalanced panels of asset returns and grow under the no-arbitrage condition.

application, information, kong machine learning season 4, (12 more...)

#artificialintelligence

Country:

Asia > China > Hong Kong (0.43)
Europe > United Kingdom (0.16)

Industry:

Media > Television (0.85)
Leisure & Entertainment (0.85)
Banking & Finance (0.51)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

How to perform topic modeling with Top2Vec

#artificialintelligenceNov-17-2021, 16:26:55 GMT

Topic modeling is a problem in natural language processing that has many real-world applications. Being able to discover topics within large sections of text helps us understand text data in greater detail. For many years, Latent Dirichlet Allocation (LDA) has been the most commonly used algorithm for topic modeling. The algorithm was first introduced in 2003 and treats topics as probability distributions for the occurrence of different words. If you want to see an example of LDA in action, you should check out my article below where I performed LDA on a fake news classification dataset.

algorithm, top2vec, vector, (16 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Add feedback