seed word
Interpretable dimensions support an effect of agentivity and telicity on split intransitivity
Neu, Eva, Dillon, Brian, Erk, Katrin
Intransitive verbs fall into two different syntactic classes, unergatives and unaccusatives. It has long been argued that verbs describing an agentive action are more likely to appear in an unergative syntax, and those describing a telic event to appear in an unaccusative syntax. However, recent work by Kim et al. (2024) found that human ratings for agentivity and telicity were a poor predictor of the syntactic behavior of intransitives. Here we revisit this question using interpretable dimensions, computed from seed words on opposite poles of the agentive and telic scales. Our findings support the link between unergativity/unaccusativity and agentivity/telicity, and demonstrate that using interpretable dimensions in conjunction with human judgments can offer valuable evidence for semantic properties that are not easily evaluated in rating tasks.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- Europe > Netherlands > South Holland > Dordrecht (0.04)
- North America > United States > Texas > Tarrant County > Arlington (0.04)
- (4 more...)
Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models
The evaluation of LLMs' creativity represents a crucial research domain, though challenges such as data contamination and costly human assessments often impede progress. Drawing inspiration from human creativity assessment, we propose PACE, asking LLMs to generate Parallel Association Chains to Evaluate their creativity. PACE minimizes the risk of data contamination and offers a straightforward, highly efficient evaluation, as evidenced by its strong correlation with Chatbot Arena Creative Writing rankings (Spearman's $ρ= 0.739$, $p < 0.001$) across various proprietary and open-source models. A comparative analysis of associative creativity between LLMs and humans reveals that while high-performing LLMs achieve scores comparable to average human performance, professional humans consistently outperform LLMs. Furthermore, linguistic analysis reveals that both humans and LLMs exhibit a trend of decreasing concreteness in their associations, and humans demonstrating a greater diversity of associative patterns.
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Europe > Germany > Saxony > Leipzig (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Measuring Corporate Human Capital Disclosures: Lexicon, Data, Code, and Research Opportunities
Demers, Elizabeth, Wang, Victor Xiaoqi, Wu, Kean
Human capital (HC) is increasingly important to corporate value creation. Unlike other assets, however, HC is not currently subject to well-defined measurement or disclosure rules. We use a machine learning algorithm (word2vec) trained on a confirmed set of HC disclosures to develop a comprehensive list of HC-related keywords classified into five subcategories (DEI; health and safety; labor relations and culture; compensation and benefits; and demographics and other) that capture the multidimensional nature of HC management. We share our lexicon, corporate HC disclosures, and the Python code used to develop the lexicon, and we provide detailed examples of using our data and code, including for fine-tuning a BERT model. Researchers can use our HC lexicon (or modify the code to capture another construct of interest) with their samples of corporate communications to address pertinent HC questions. We close with a discussion of future research opportunities related to HC management and disclosure.
- Europe > United Kingdom (0.14)
- North America > United States > New York > Monroe County > Rochester (0.05)
- Europe > France > Normandy > Seine-Maritime > Rouen (0.04)
- (4 more...)
- Research Report (1.00)
- Overview (1.00)
- Public Relations > Community Relations (0.46)
- Social Sector (1.00)
- Law > Labor & Employment Law (1.00)
- Law > Civil Rights & Constitutional Law (1.00)
- (13 more...)
Constrained Non-negative Matrix Factorization for Guided Topic Modeling of Minority Topics
Ebrahimi, Seyedeh Fatemeh, Peltonen, Jaakko
Topic models often fail to capture low-prevalence, domain-critical themes, so-called minority topics, such as mental health themes in online comments. While some existing methods can incorporate domain knowledge, such as expected topical content, methods allowing guidance may require overly detailed expected topics, hindering the discovery of topic divisions and variation. We propose a topic modeling solution via a specially constrained NMF. We incorporate a seed word list characterizing minority content of interest, but we do not require experts to pre-specify their division across minority topics. Through prevalence constraints on minority topics and seed word content across topics, we learn distinct data-driven minority topics as well as majority topics. The constrained NMF is fitted via Karush-Kuhn-Tucker (KKT) conditions with multiplicative updates. We outperform several baselines on synthetic data in terms of topic purity, normalized mutual information, and also evaluate topic quality using Jensen-Shannon divergence (JSD). We conduct a case study on YouTube vlog comments, analyzing viewer discussion of mental health content; our model successfully identifies and reveals this domain-relevant minority content.
- Asia > Middle East > Jordan (0.04)
- Asia > Middle East > Israel (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (17 more...)
Seeded Poisson Factorization: Leveraging domain knowledge to fit topic models
Prostmaier, Bernd, Vávra, Jan, Grün, Bettina, Hofmarcher, Paul
Topic models are widely used for discovering latent thematic structures in large text corpora, yet traditional unsupervised methods often struggle to align with predefined conceptual domains. This paper introduces Seeded Poisson Factorization (SPF), a novel approach that extends the Poisson Factorization framework by incorporating domain knowledge through seed words. SPF enables a more interpretable and structured topic discovery by modifying the prior distribution of topic-specific term intensities, assigning higher initial rates to predefined seed words. The model is estimated using variational inference with stochastic gradient optimization, ensuring scalability to large datasets. We apply SPF to an Amazon customer feedback dataset, leveraging predefined product categories as guiding structures. Our evaluation demonstrates that SPF achieves superior classification performance compared to alternative guided topic models, particularly in terms of computational efficiency and predictive performance. Furthermore, robustness checks highlight SPF's ability to adaptively balance domain knowledge and data-driven topic discovery, even in cases of imperfect seed word selection. These results establish SPF as a powerful and scalable alternative for integrating expert knowledge into topic modeling, enhancing both interpretability and efficiency in real-world applications.
- Asia > Middle East > Jordan (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (5 more...)
- Consumer Products & Services (0.46)
- Retail > Online (0.34)
- Information Technology (0.34)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)
Rethinking Word Similarity: Semantic Similarity through Classification Confusion
Zhou, Kaitlyn, Gao, Haishan, Chen, Sarah, Edelstein, Dan, Jurafsky, Dan, Shani, Chen
Word similarity has many applications to social science and cultural analytics tasks like measuring meaning change over time and making sense of contested terms. Yet traditional similarity methods based on cosine similarity between word embeddings cannot capture the context-dependent, asymmetrical, polysemous nature of semantic similarity. We propose a new measure of similarity, Word Confusion, that reframes semantic similarity in terms of feature-based classification confusion. Word Confusion is inspired by Tversky's suggestion that similarity features be chosen dynamically. Here we train a classifier to map contextual embeddings to word identities and use the classifier confusion (the probability of choosing a confounding word c instead of the correct target word t) as a measure of the similarity of c and t. The set of potential confounding words acts as the chosen features. Our method is comparable to cosine similarity in matching human similarity judgments across several datasets (MEN, WirdSim353, and SimLex), and can measure similarity using predetermined features of interest. We demonstrate our model's ability to make use of dynamic features by applying it to test a hypothesis about changes in the 18th C. meaning of the French word "revolution" from popular to state action during the French Revolution. We hope this reimagining of semantic similarity will inspire the development of new tools that better capture the multi-faceted and dynamic nature of language, advancing the fields of computational social science and cultural analytics and beyond.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > California > Riverside County > Riverside (0.04)
- (12 more...)
LITA: An Efficient LLM-assisted Iterative Topic Augmentation Framework
Chang, Chia-Hsuan, Tsai, Jui-Tse, Tsai, Yi-Hang, Hwang, San-Yih
Topic modeling is widely used for uncovering thematic structures within text corpora, yet traditional models often struggle with specificity and coherence in domain-focused applications. Guided approaches, such as SeededLDA and CorEx, incorporate user-provided seed words to improve relevance but remain labor-intensive and static. Large language models (LLMs) offer potential for dynamic topic refinement and discovery, yet their application often incurs high API costs. To address these challenges, we propose the LLM-assisted Iterative Topic Augmentation framework (LITA), an LLM-assisted approach that integrates user-provided seeds with embedding-based clustering and iterative refinement. LITA identifies a small number of ambiguous documents and employs an LLM to reassign them to existing or new topics, minimizing API costs while enhancing topic quality. Experiments on two datasets across topic quality and clustering performance metrics demonstrate that LITA outperforms five baseline models, including LDA, SeededLDA, CorEx, BERTopic, and PromptTopic. Our work offers an efficient and adaptable framework for advancing topic modeling and text clustering.
- North America > United States > Connecticut > New Haven County > New Haven (0.04)
- Asia > Taiwan > Takao Province > Kaohsiung (0.04)
- Asia > Middle East > Jordan (0.04)
Building Knowledge-Guided Lexica to Model Cultural Variation
Havaldar, Shreya, Giorgi, Salvatore, Rai, Sunny, Talhelm, Thomas, Guntuku, Sharath Chandra, Ungar, Lyle
Cultural variation exists between nations (e.g., the United States vs. China), but also within regions (e.g., California vs. Texas, Los Angeles vs. San Francisco). Measuring this regional cultural variation can illuminate how and why people think and behave differently. Historically, it has been difficult to computationally model cultural variation due to a lack of training data and scalability constraints. In this work, we introduce a new research problem for the NLP community: How do we measure variation in cultural constructs across regions using language? We then provide a scalable solution: building knowledge-guided lexica to model cultural variation, encouraging future work at the intersection of NLP and cultural understanding. We also highlight modern LLMs' failure to measure cultural variation or generate culturally varied language.
- Asia > China (0.24)
- North America > United States > California > San Francisco County > San Francisco (0.24)
- North America > United States > California > Los Angeles County > Los Angeles (0.24)
- (13 more...)
- Health & Medicine > Therapeutic Area (0.47)
- Information Technology > Services (0.46)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Enhancing Social Media Post Popularity Prediction with Visual Content
Jeong, Dahyun, Son, Hyelim, Choi, Yunjin, Kim, Keunwoo
Our study presents a framework for predicting image-based social media content popularity that focuses on addressing complex image information and a hierarchical data structure. We utilize the Google Cloud Vision API to effectively extract key image and color information from users' postings, achieving 6.8% higher accuracy compared to using non-image covariates alone. For prediction, we explore a wide range of prediction models, including Linear Mixed Model, Support Vector Regression, Multi-layer Perceptron, Random Forest, and XGBoost, with linear regression as the benchmark. Our comparative study demonstrates that models that are capable of capturing the underlying nonlinear interactions between covariates outperform other methods.
- Asia > South Korea > Seoul > Seoul (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Information Technology > Services (1.00)
- Health & Medicine (1.00)
Adjusting Interpretable Dimensions in Embedding Space with Human Judgments
Erk, Katrin, Apidianaki, Marianna
Embedding spaces contain interpretable dimensions indicating gender, formality in style, or even object properties. This has been observed multiple times. Such interpretable dimensions are becoming valuable tools in different areas of study, from social science to neuroscience. The standard way to compute these dimensions uses contrasting seed words and computes difference vectors over them. This is simple but does not always work well. We combine seed-based vectors with guidance from human ratings of where words fall along a specific dimension, and evaluate on predicting both object properties like size and danger, and the stylistic properties of formality and complexity. We obtain interpretable dimensions with markedly better performance especially in cases where seed-based dimensions do not work well.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > New York (0.04)
- (28 more...)