npmi
A Appendix
A.1 List of Neural T opic Modeling Works used in our Meta-Analysis Corpus statistics are in Table 7. Document processing - We do not process documents with fewer than 25 whitespace-separated tokens. Following processing (e.g., stopword removal), we remove documents with fewer than The vocabulary is created from the training data. Stop-words are retained if they are contained within detected noun entities (e.g., "The United States of America" united_states_of_america). - We filter out tokens with two or fewer characters. Standard rules-of-thumb for vocabulary pruning, like removing terms that appear in fewer than 0.5% of To keep vocabulary sizes roughly consistent across datasets, we set the minimum document-frequency for terms as a (power) function of the total corpus size. We use gensim ( ˇ Reh u ˇ rek and Sojka, 2010) as a Python wrapper for running Mallet.
Semantic Analysis of SNOMED CT Concept Co-occurrences in Clinical Documentation using MIMIC-IV
Noori, Ali, Mohanty, Somya, Manda, Prashanti
Clinical notes contain rich clinical narratives but their unstructured format poses challenges for large-scale analysis. Standardized terminologies such as SNOMED CT improve interoperability, yet understanding how concepts relate through co-occurrence and semantic similarity remains underexplored. In this study, we leverage the MIMIC-IV database to investigate the relationship between SNOMED CT concept co-occurrence patterns and embedding-based semantic similarity. Using Normalized Pointwise Mutual Information (NPMI) and pretrained embeddings (e.g., ClinicalBERT, BioBERT), we examine whether frequently co-occurring concepts are also semantically close, whether embeddings can suggest missing concepts, and how these relationships evolve temporally and across specialties. Our analyses reveal that while co-occurrence and semantic similarity are weakly correlated, embeddings capture clinically meaningful associations not always reflected in documentation frequency. Embedding-based suggestions frequently matched concepts later documented, supporting their utility for augmenting clinical annotations. Clustering of concept embeddings yielded coherent clinical themes (symptoms, labs, diagnoses, cardiovascular conditions) that map to patient phenotypes and care patterns. Finally, co-occurrence patterns linked to outcomes such as mortality and readmission demonstrate the practical utility of this approach. Collectively, our findings highlight the complementary value of co-occurrence statistics and semantic embeddings in improving documentation completeness, uncovering latent clinical relationships, and informing decision support and phenotyping applications.
- North America > United States > North Carolina > Guilford County > Greensboro (0.14)
- North America > United States > Nebraska > Douglas County > Omaha (0.14)
- Asia > Middle East > Israel (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.69)
Evaluating Negative Sampling Approaches for Neural Topic Models
Adhya, Suman, Lahiri, Avishek, Sanyal, Debarshi Kumar, Das, Partha Pratim
Negative sampling has emerged as an effective technique that enables deep learning models to learn better representations by introducing the paradigm of learn-to-compare. The goal of this approach is to add robustness to deep learning models to learn better representation by comparing the positive samples against the negative ones. Despite its numerous demonstrations in various areas of computer vision and natural language processing, a comprehensive study of the effect of negative sampling in an unsupervised domain like topic modeling has not been well explored. In this paper, we present a comprehensive analysis of the impact of different negative sampling strategies on neural topic models. We compare the performance of several popular neural topic models by incorporating a negative sampling technique in the decoder of variational autoencoder-based neural topic models. Experiments on four publicly available datasets demonstrate that integrating negative sampling into topic models results in significant enhancements across multiple aspects, including improved topic coherence, richer topic diversity, and more accurate document classification. Manual evaluations also indicate that the inclusion of negative sampling into neural topic models enhances the quality of the generated topics. These findings highlight the potential of negative sampling as a valuable tool for advancing the effectiveness of neural topic models.
- Asia > India > West Bengal > Kolkata (0.14)
- Asia > India > West Bengal > Kharagpur (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (2 more...)
- Transportation (0.93)
- Leisure & Entertainment (0.93)
- Media > Film (0.46)
Words as Gatekeepers: Measuring Discipline-specific Terms and Meanings in Scholarly Publications
Lucy, Li, Dodge, Jesse, Bamman, David, Keith, Katherine A.
Scholarly text is often laden with jargon, or specialized language that can facilitate efficient in-group communication within fields but hinder understanding for out-groups. In this work, we develop and validate an interpretable approach for measuring scholarly jargon from text. Expanding the scope of prior work which focuses on word types, we use word sense induction to also identify words that are widespread but overloaded with different meanings across fields. We then estimate the prevalence of these discipline-specific words and senses across hundreds of subfields, and show that word senses provide a complementary, yet unique view of jargon alongside word types. We demonstrate the utility of our metrics for science of science and computational sociolinguistics by highlighting two key social implications. First, though most fields reduce their use of jargon when writing for general-purpose venues, and some fields (e.g., biological sciences) do so less than others. Second, the direction of correlation between jargon and citation rates varies among fields, but jargon is nearly always negatively correlated with interdisciplinary impact. Broadly, our findings suggest that though multidisciplinary venues intend to cater to more general audiences, some fields' writing norms may act as barriers rather than bridges, and thus impede the dispersion of scholarly ideas.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- (12 more...)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
- Energy (0.93)
- (2 more...)
Improving Contextualized Topic Models with Negative Sampling
Adhya, Suman, Lahiri, Avishek, Sanyal, Debarshi Kumar, Das, Partha Pratim
Topic modeling has emerged as a dominant method for exploring large document collections. Recent approaches to topic modeling use large contextualized language models and variational autoencoders. In this paper, we propose a negative sampling mechanism for a contextualized topic model to improve the quality of the generated topics. In particular, during model training, we perturb the generated document-topic vector and use a triplet loss to encourage the document reconstructed from the correct document-topic vector to be similar to the input document and dissimilar to the document reconstructed from the perturbed vector. Experiments for different topic counts on three publicly available benchmark datasets show that in most cases, our approach leads to an increase in topic coherence over that of the baselines. Our model also achieves very high topic diversity.
- Asia > Middle East > Jordan (0.04)
- Asia > India > West Bengal > Kolkata (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (5 more...)
- Leisure & Entertainment (0.46)
- Banking & Finance (0.46)
Topics as Entity Clusters: Entity-based Topics from Language Models and Graph Neural Networks
Loureiro, Manuel V., Derby, Steven, Wijaya, Tri Kurniawan
Topic models aim to reveal the latent structure behind a corpus, typically conducted over a bag-of-words representation of documents. In the context of topic modeling, most vocabulary is either irrelevant for uncovering underlying topics or contains strong relationships with relevant concepts, impacting the interpretability of these topics. Furthermore, their limited expressiveness and dependency on language demand considerable computation resources. Hence, we propose a novel approach for cluster-based topic modeling that employs conceptual entities. Entities are language-agnostic representations of real-world concepts rich in relational information. To this end, we extract vector representations of entities from (i) an encyclopedic corpus using a language model; and (ii) a knowledge base using a graph neural network. We demonstrate that our approach consistently outperforms other state-of-the-art topic models across coherency metrics and find that the explicit knowledge encoded in the graph-based embeddings provides more coherent topics than the implicit knowledge encoded with the contextualized embeddings of language models.
- North America > United States > New York > New York County > New York City (0.05)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Hong Kong (0.04)
- (8 more...)
- Leisure & Entertainment > Sports > Boxing (1.00)
- Information Technology (0.93)
- Automobiles & Trucks (0.68)
Moving beyond word lists: towards abstractive topic labels for human-like topics of scientific documents
Topic models represent groups of documents as a list of words (the topic labels). This work asks whether an alternative approach to topic labeling can be developed that is closer to a natural language description of a topic than a word list. To this end, we present an approach to generating human-like topic labels using abstractive multi-document summarization (MDS). We investigate our approach with an exploratory case study. We model topics in citation sentences in order to understand what further research needs to be done to fully operationalize MDS for topic labeling. Our case study shows that in addition to more human-like topics there are additional advantages to evaluation by using clustering and summarization measures instead of topic model measures. However, we find that there are several developments needed before we can design a well-powered study to evaluate MDS for topic modeling fully. Namely, improving cluster cohesion, improving the factuality and faithfulness of MDS, and increasing the number of documents that might be supported by MDS. We present a number of ideas on how these can be tackled and conclude with some thoughts on how topic modeling can also be used to improve MDS in general.
- Asia > Middle East > Jordan (0.05)
- North America > United States > New York > Kings County > New York City (0.04)
- North America > Dominican Republic (0.04)
- (2 more...)
Exploring Semantic Capacity of Terms
Huang, Jie, Wang, Zilong, Chang, Kevin Chen-Chuan, Hwu, Wen-mei, Xiong, Jinjun
We introduce and study semantic capacity of terms. For example, the semantic capacity of artificial intelligence is higher than that of linear regression since artificial intelligence possesses a broader meaning scope. Understanding semantic capacity of terms will help many downstream tasks in natural language processing. For this purpose, we propose a two-step model to investigate semantic capacity of terms, which takes a large text corpus as input and can evaluate semantic capacity of terms if the text corpus can provide enough co-occurrence information of terms. Extensive experiments in three fields demonstrate the effectiveness and rationality of our model compared with well-designed baselines and human-level evaluations.
- North America > United States > Illinois (0.05)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)