AITopics | word list

Collaborating Authors

word list

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

fce2d8a485746f76aac7b5650db2679d-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 19:30:26 GMT

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > Canada > Alberta (0.14)
Europe > United Kingdom > England (0.04)
Asia > China (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (0.93)

Industry:

Law (0.93)
Banking & Finance > Economy (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Data Science (0.92)
(2 more...)

Add feedback

Probing Social Bias in Labor Market Text Generation by ChatGPT: A Masked Language Model Approach Lei Ding

Neural Information Processing SystemsOct-10-2025, 22:21:56 GMT

The complexity of automating bias evaluation in textual content poses significant challenges. Traditional approaches in social sciences, such as content analysis, often rely on manual word counts from static lists [Gaucher et al., 2011], which may miss the subtleties and unlisted language cues that advanced NLP technologies can detect.

application, chatgpt, job application, (16 more...)

Neural Information Processing Systems

Country:

North America > Canada > Alberta (0.14)
Europe > United Kingdom > England (0.04)
Asia > China (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (0.93)

Industry:

Law (0.93)
Banking & Finance > Economy (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.85)

Add feedback

201d546992726352471cfea6b0df0a48-AuthorFeedback.pdf

Neural Information Processing SystemsOct-2-2025, 08:45:43 GMT

effect size, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.33)
Information Technology > Artificial Intelligence > Machine Learning (0.32)

Add feedback

Measuring Corporate Human Capital Disclosures: Lexicon, Data, Code, and Research Opportunities

Demers, Elizabeth, Wang, Victor Xiaoqi, Wu, Kean

arXiv.org Artificial IntelligenceJun-13-2025

Human capital (HC) is increasingly important to corporate value creation. Unlike other assets, however, HC is not currently subject to well-defined measurement or disclosure rules. We use a machine learning algorithm (word2vec) trained on a confirmed set of HC disclosures to develop a comprehensive list of HC-related keywords classified into five subcategories (DEI; health and safety; labor relations and culture; compensation and benefits; and demographics and other) that capture the multidimensional nature of HC management. We share our lexicon, corporate HC disclosures, and the Python code used to develop the lexicon, and we provide detailed examples of using our data and code, including for fine-tuning a BERT model. Researchers can use our HC lexicon (or modify the code to capture another construct of interest) with their samples of corporate communications to address pertinent HC questions. We close with a discussion of future research opportunities related to HC management and disclosure.

community relations, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.2308/ISYS-2023-023

2506.10155

Country:

Europe > United Kingdom (0.14)
North America > United States > New York > Monroe County > Rochester (0.05)
Europe > France > Normandy > Seine-Maritime > Rouen (0.04)
(4 more...)

Genre:

Research Report (1.00)
Overview (1.00)
Public Relations > Community Relations (0.46)

Industry:

Social Sector (1.00)
Law > Labor & Employment Law (1.00)
Law > Civil Rights & Constitutional Law (1.00)
(13 more...)

Technology:

Information Technology > Enterprise Applications > Human Resources (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Masculine Defaults via Gendered Discourse in Podcasts and Large Language Models

Teleki, Maria, Dong, Xiangjue, Liu, Haoran, Caverlee, James

arXiv.org Artificial IntelligenceApr-16-2025

Masculine defaults are widely recognized as a significant type of gender bias, but they are often unseen as they are under-researched. Masculine defaults involve three key parts: (i) the cultural context, (ii) the masculine characteristics or behaviors, and (iii) the reward for, or simply acceptance of, those masculine characteristics or behaviors. In this work, we study discourse-based masculine defaults, and propose a twofold framework for (i) the large-scale discovery and analysis of gendered discourse words in spoken content via our Gendered Discourse Correlation Framework (GDCF); and (ii) the measurement of the gender bias associated with these gendered discourse words in LLMs via our Discourse Word-Embedding Association Test (D-WEAT). We focus our study on podcasts, a popular and growing form of social media, analyzing 15,117 podcast episodes. We analyze correlations between gender and discourse words -- discovered via LDA and BERTopic -- to automatically form gendered discourse word lists. We then study the prevalence of these gendered discourse words in domain-specific contexts, and find that gendered discourse-based masculine defaults exist in the domains of business, technology/politics, and video games. Next, we study the representation of these gendered discourse words from a state-of-the-art LLM embedding model from OpenAI, and find that the masculine discourse words have a more stable and robust representation than the feminine discourse words, which may result in better system performance on downstream tasks for men. Hence, men are rewarded for their discourse patterns with better system performance by one of the state-of-the-art language models -- and this embedding disparity is a representational harm and a masculine default.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.11431

Country:

Asia > Middle East > Jordan (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)
(10 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Media (1.00)
Government > Regional Government (0.46)
Leisure & Entertainment > Games > Computer Games (0.34)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data

Tjuka, Annika, Forkel, Robert, Rzymski, Christoph, List, Johann-Mattis

arXiv.org Artificial IntelligenceMar-14-2025

Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative studies of words with multiple meanings, a phenomenon known as colexification. The new version includes improvements in the handling, selection and presentation of the data. We compare the new database with previous versions and find that our improvements provide a more balanced sample covering more language families worldwide, with an enhanced data quality, given that all word forms are provided in phonetic transcription. We conclude that the new Database of Cross-Linguistic Colexifications has the potential to inspire exciting new studies that link cross-linguistic data to open questions in linguistic typology, historical linguistics, psycholinguistics, and computational linguistics.

artificial intelligence, natural language, text processing, (16 more...)

arXiv.org Artificial Intelligence

2503.11377

Country:

Europe > Germany > Saxony > Leipzig (0.05)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(8 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine (0.67)
Education (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.48)

Add feedback

The creative psychometric item generator: a framework for item generation and validation using large language models

Laverghetta, Antonio Jr., Luchini, Simone, Linell, Averie, Reiter-Palmon, Roni, Beaty, Roger

arXiv.org Artificial IntelligenceAug-30-2024

Increasingly, large language models (LLMs) are being used to automate workplace processes requiring a high degree of creativity. While much prior work has examined the creativity of LLMs, there has been little research on whether they can generate valid creativity assessments for humans despite the increasingly central role of creativity in modern economies. We develop a psychometrically inspired framework for creating test items (questions) for a classic free-response creativity test: the creative problem-solving (CPS) task. Our framework, the creative psychometric item generator (CPIG), uses a mixture of LLM-based item generators and evaluators to iteratively develop new prompts for writing CPS items, such that items from later iterations will elicit more creative responses from test takers. We find strong empirical evidence that CPIG generates valid and reliable items and that this effect is not attributable to known biases in the evaluation process. Our findings have implications for employing LLMs to automatically generate valid and reliable creativity tests for humans and AI.

creativity, item generation, originality, (12 more...)

arXiv.org Artificial Intelligence

2409.00202

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Pennsylvania > Centre County > University Park (0.04)
North America > United States > New York (0.04)
(4 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Government > Regional Government > North America Government > United States Government (0.93)
Government > Military (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

PostMark: A Robust Blackbox Watermark for Large Language Models

Chang, Yapei, Krishna, Kalpesh, Houmansadr, Amir, Wieting, John, Iyyer, Mohit

arXiv.org Artificial IntelligenceJun-20-2024

The most effective techniques to detect LLM-generated text rely on inserting a detectable signature -- or watermark -- during the model's decoding process. Most existing watermarking methods require access to the underlying LLM's logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. In this paper, we develop PostMark, a modular post-hoc watermarking procedure in which an input-dependent set of words (determined via a semantic embedding) is inserted into the text after the decoding process has completed. Critically, PostMark does not require logit access, which means it can be implemented by a third party. We also show that PostMark is more robust to paraphrasing attacks than existing watermarking methods: our experiments cover eight baseline algorithms, five base LLMs, and three datasets. Finally, we evaluate the impact of PostMark on text quality using both automated and human assessments, highlighting the trade-off between quality and robustness to paraphrasing. We release our code, outputs, and annotations at https://github.com/lilakk/PostMark.

ark, language model, watermark, (16 more...)

arXiv.org Artificial Intelligence

2406.14517

Country:

North America > United States > Colorado (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

Statistical Uncertainty in Word Embeddings: GloVe-V

Vallebueno, Andrea, Handan-Nader, Cassandra, Manning, Christopher D., Ho, Daniel E.

arXiv.org Artificial IntelligenceJun-17-2024

Static word embeddings are ubiquitous in computational social science applications and contribute to practical decision-making in a variety of fields including law and healthcare. However, assessing the statistical uncertainty in downstream conclusions drawn from word embedding statistics has remained challenging. When using only point estimates for embeddings, researchers have no streamlined way of assessing the degree to which their model selection criteria or scientific conclusions are subject to noise due to sparsity in the underlying data used to generate the embeddings. We introduce a method to obtain approximate, easy-to-use, and scalable reconstruction error variance estimates for GloVe (Pennington et al., 2014), one of the most widely used word embedding models, using an analytical approximation to a multivariate normal model. To demonstrate the value of embeddings with variance (GloVe-V), we illustrate how our approach enables principled hypothesis testing in core word embedding tasks, such as comparing the similarity between different word pairs in vector space, assessing the performance of different models, and analyzing the relative degree of ethnic or gender bias in a corpus using different word lists.

computational linguistic, proceedings, vector, (16 more...)

arXiv.org Artificial Intelligence

2406.12165

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(16 more...)

Genre: Research Report (0.83)

Industry: Health & Medicine > Therapeutic Area (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Zyda: A 1.3T Dataset for Open Language Modeling

Tokpanov, Yury, Millidge, Beren, Glorioso, Paolo, Pilault, Jonathan, Ibrahim, Adam, Whittington, James, Anthony, Quentin

arXiv.org Artificial IntelligenceJun-4-2024

The size of large language models (LLMs) has scaled dramatically in recent years and their computational and data requirements have surged correspondingly. State-of-the-art language models, even at relatively smaller sizes, typically require training on at least a trillion tokens. This rapid advancement has eclipsed the growth of open-source datasets available for large-scale LLM pretraining. In this paper, we introduce Zyda (Zyphra Dataset), a dataset under a permissive license comprising 1.3 trillion tokens, assembled by integrating several major respected open-source datasets into a single, high-quality corpus. We apply rigorous filtering and deduplication processes, both within and across datasets, to maintain and enhance the quality derived from the original datasets. Our evaluations show that Zyda not only competes favorably with other open datasets like Dolma, FineWeb, and RefinedWeb, but also substantially improves the performance of comparable models from the Pythia suite. Our rigorous data processing methods significantly enhance Zyda's effectiveness, outperforming even the best of its constituent datasets when used independently.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2406.01981

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)

Add feedback