top2vec
Hybrid Topic-Semantic Labeling and Graph Embeddings for Unsupervised Legal Document Clustering
Bastola, Deepak, Choi, Woohyeok
Legal documents pose unique challenges for text classification due to their domain-specific language and often limited labeled data. This paper proposes a hybrid approach for classifying legal texts by combining unsupervised topic and graph embeddings with a supervised model. We employ Top2Vec to learn semantic document embeddings and automatically discover latent topics, and Node2Vec to capture structural relationships via a bipartite graph of legal documents. The embeddings are combined and clustered using KMeans, yielding coherent groupings of documents. Our computations on a legal document dataset demonstrate that the combined Top2Vec+Node2Vec approach improves clustering quality over text-only or graph-only embeddings. We conduct a sensitivity analysis of hyperparameters, such as the number of clusters and the dimensionality of the embeddings, and demonstrate that our method achieves competitive performance against baseline Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) models. Key findings indicate that while the pipeline presents an innovative approach to unsupervised legal document analysis by combining semantic topic modeling with graph embedding techniques, its efficacy is contingent upon the quality of initial topic generation and the representational power of the chosen embedding models for specialized legal language. Strategic recommendations include the exploration of domain-specific embeddings, more comprehensive hyperparameter tuning for Node2Vec, dynamic determination of cluster numbers, and robust human-in-the-loop validation processes to enhance legal relevance and trustworthiness. The pipeline demonstrates potential for exploratory legal data analysis and as a precursor to supervised learning tasks but requires further refinement and domain-specific adaptation for practical legal applications.
- North America > United States > California (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > India (0.04)
- Research Report (1.00)
- Overview (0.88)
BERTopic for Topic Modeling of Hindi Short Texts: A Comparative Study
Mutsaddi, Atharva, Jamkhande, Anvi, Thakre, Aryan, Haribhakta, Yashodhara
As short text data in native languages like Hindi increasingly appear in modern media, robust methods for topic modeling on such data have gained importance. This study investigates the performance of BERTopic in modeling Hindi short texts, an area that has been under-explored in existing research. Using contextual embeddings, BERTopic can capture semantic relationships in data, making it potentially more effective than traditional models, especially for short and diverse texts. We evaluate BERTopic using 6 different document embedding models and compare its performance against 8 established topic modeling techniques, such as Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Latent Semantic Indexing (LSI), Additive Regularization of Topic Models (ARTM), Probabilistic Latent Semantic Analysis (PLSA), Embedded Topic Model (ETM), Combined Topic Model (CTM), and Top2Vec. The models are assessed using coherence scores across a range of topic counts. Our results reveal that BERTopic consistently outperforms other models in capturing coherent topics from short Hindi texts.
- North America > United States > New York > New York County > New York City (0.05)
- Europe > Switzerland (0.04)
- Asia > Middle East > Jordan (0.04)
- (4 more...)
Insights Into the Nutritional Prevention of Macular Degeneration based on a Comparative Topic Modeling Approach
Topic modeling and text mining are subsets of Natural Language Processing (NLP) with relevance for conducting meta-analysis (MA) and systematic review (SR). For evidence synthesis, the above NLP methods are conventionally used for topic-specific literature searches or extracting values from reports to automate essential phases of SR and MA. Instead, this work proposes a comparative topic modeling approach to analyze reports of contradictory results on the same general research question. Specifically, the objective is to identify topics exhibiting distinct associations with significant results for an outcome of interest by ranking them according to their proportional occurrence in (and consistency of distribution across) reports of significant effects. The proposed method was tested on broad-scope studies addressing whether supplemental nutritional compounds significantly benefit macular degeneration (MD). Four of these were further supported in terms of effectiveness upon conducting a follow-up literature search for validation (omega-3 fatty acids, copper, zeaxanthin, and nitrates). The two not supported by the follow-up literature search (niacin and molybdenum) also had scores in the lowest range under the proposed scoring system, suggesting that the proposed methods score for a given topic may be a viable proxy for its degree of association with the outcome of interest and can be helpful in the search for potentially causal relationships. These results underpin the proposed methods potential to add specificity in understanding effects from broad-scope reports, elucidate topics of interest for future research, and guide evidence synthesis in a systematic and scalable way. All of this is accomplished while yielding valuable insights into the prevention of MD.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- North America > Canada (0.14)
- Europe > Austria > Styria > Graz (0.04)
- (3 more...)
- Research Report > Strength High (1.00)
- Research Report > Experimental Study > Negative Result (0.34)
- Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)
- Health & Medicine > Consumer Health (1.00)
- Education > Health & Safety > School Nutrition (1.00)
Exploring the Power of Topic Modeling Techniques in Analyzing Customer Reviews: A Comparative Analysis
The exponential growth of online social network platforms and applications has led to a staggering volume of user-generated textual content, including comments and reviews. Consequently, users often face difficulties in extracting valuable insights or relevant information from such content. To address this challenge, machine learning and natural language processing algorithms have been deployed to analyze the vast amount of textual data available online. In recent years, topic modeling techniques have gained significant popularity in this domain. In this study, we comprehensively examine and compare five frequently used topic modeling methods specifically applied to customer reviews. The methods under investigation are latent semantic analysis (LSA), latent Dirichlet allocation (LDA), non-negative matrix factorization (NMF), pachinko allocation model (PAM), Top2Vec, and BERTopic. By practically demonstrating their benefits in detecting important topics, we aim to highlight their efficacy in real-world scenarios. To evaluate the performance of these topic modeling methods, we carefully select two textual datasets. The evaluation is based on standard statistical evaluation metrics such as topic coherence score. Our findings reveal that BERTopic consistently yield more meaningful extracted topics and achieve favorable results.
- Asia > Middle East > UAE (0.14)
- Asia > India (0.04)
- North America > United States > New York (0.04)
- (5 more...)
- Information Technology > Services (0.48)
- Health & Medicine > Therapeutic Area > Immunology (0.47)
- Banking & Finance > Trading (0.46)
Hong Kong Machine Learning Season 4 Episode 4
We are looking to organize online x in-person meetups on HK island going forward. Thanks to our sponsor Darwinex to help us supporting the various costs. Abstract: We introduce a class of interpretable tree-based models (P-Trees) for analyzing panel data, with iterative and global (instead of recursive and local) splitting criteria to avoid overfitting and improve model performance. We apply P-Tree to generate a stochastic discount factor model and test assets for cross-sectional asset pricing. Unlike other tree algorithms, P-Trees accommodate imbalanced panels of asset returns and grow under the no-arbitrage condition.
- Asia > China > Hong Kong (0.43)
- Europe > United Kingdom (0.16)
- Media > Television (0.85)
- Leisure & Entertainment (0.85)
- Banking & Finance (0.51)
How to perform topic modeling with Top2Vec
Topic modeling is a problem in natural language processing that has many real-world applications. Being able to discover topics within large sections of text helps us understand text data in greater detail. For many years, Latent Dirichlet Allocation (LDA) has been the most commonly used algorithm for topic modeling. The algorithm was first introduced in 2003 and treats topics as probability distributions for the occurrence of different words. If you want to see an example of LDA in action, you should check out my article below where I performed LDA on a fake news classification dataset.