AITopics | krippendorff

Collaborating Authors

krippendorff

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage

Platt, Nolan, Luchs, Ethan, Nizamani, Sehrish

arXiv.org Artificial IntelligenceDec-5-2025

Usability evaluations are essential for ensuring that modern interfaces meet user needs, yet traditional heuristic evaluations by human experts can be time-consuming and subjective, especially early in development. This paper investigates whether large language models (LLMs) can provide reliable and consistent heuristic assessments at the development stage. By applying Jakob Nielsen's ten usability heuristics to thirty open-source websites, we generated over 850 heuristic evaluations in three independent evaluations per site using a pipeline of OpenAI's GPT-4o. For issue detection, the model demonstrated moderate consistency, with an average pairwise Cohen's Kappa of 0.50 and an exact agreement of 84%. Severity judgments showed more variability: weighted Cohen's Kappa averaged 0.63, but exact agreement was just 56%, and Krippendorff's Alpha was near zero. These results suggest that while GPT-4o can produce internally consistent evaluations, especially for identifying the presence of usability issues, its ability to judge severity varies and requires human oversight in practice. Our findings highlight the feasibility and limitations of using LLMs for early-stage, automated usability testing, and offer a foundation for improving consistency in automated User Experience (UX) evaluation. To the best of our knowledge, our work provides one of the first quantitative inter-rater reliability analyses of automated heuristic evaluation and highlights methods for improving model consistency.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/VL-HCC65237.2025.00024

2512.04262

Country: North America > United States > Virginia (0.16)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Text Annotation via Inductive Coding: Comparing Human Experts to LLMs in Qualitative Data Analysis

Parfenova, Angelina, Marfurt, Andreas, Denzler, Alexander, Pfeffer, Juergen

arXiv.org Artificial IntelligenceDec-2-2025

This paper investigates the automation of qualitative data analysis, focusing on inductive coding using large language models (LLMs). Unlike traditional approaches that rely on deductive methods with predefined labels, this research investigates the inductive process where labels emerge from the data. The study evaluates the performance of six open-source LLMs compared to human experts. As part of the evaluation, experts rated the perceived difficulty of the quotes they coded. The results reveal a peculiar dichotomy: human coders consistently perform well when labeling complex sentences but struggle with simpler ones, while LLMs exhibit the opposite trend. Additionally, the study explores systematic deviations in both human and LLM generated labels by comparing them to the golden standard from the test set. While human annotations may sometimes differ from the golden standard, they are often rated more favorably by other humans. In contrast, some LLMs demonstrate closer alignment with the true labels but receive lower evaluations from experts.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2025.findings-naacl.361

2512.00046

Country:

Europe (0.46)
North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Efficiently Transforming Neural Networks into Decision Trees: A Path to Ground Truth Explanations with RENTT

Monke, Helena, Fresz, Benjamin, Bernreuther, Marco, Chen, Yilin, Huber, Marco F.

arXiv.org Artificial IntelligenceNov-13-2025

Although neural networks are a powerful tool, their widespread use is hindered by the opacity of their decisions and their black-box nature, which result in a lack of trustworthiness. To alleviate this problem, methods in the field of explainable Artificial Intelligence try to unveil how such automated decisions are made. But explainable AI methods are often plagued by missing faithfulness/correctness, meaning that they sometimes provide explanations that do not align with the neural network's decision and logic. Recently, transformations to decision trees have been proposed to overcome such problems. Unfortunately, they typically lack exactness, scalability, or interpretability as the size of the neural network grows. Thus, we generalize these previous results, especially by considering convolutional neural networks, recurrent neural networks, non-ReLU activation functions, and bias terms. Our findings are accompanied by rigorous proofs and we present a novel algorithm RENTT (Runtime Efficient Network to Tree Transformation) designed to compute an exact equivalent decision tree representation of neural networks in a manner that is both runtime and memory efficient. The resulting decision trees are multivariate and thus, possibly too complex to understand. To alleviate this problem, we also provide a method to calculate the ground truth feature importance for neural networks via the equivalent decision trees - for entire models (global), specific input regions (regional), or single decisions (local). All theoretical results are supported by detailed numerical experiments that emphasize two key aspects: the computational efficiency and scalability of our algorithm, and that only RENTT succeeds in uncovering ground truth explanations compared to conventional approximation methods like LIME and SHAP. All code is available at https://github.com/HelenaM23/RENTT .

artificial intelligence, fi method, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2511.09299

Country:

Europe (1.00)
North America > United States (0.68)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)

Add feedback

MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

Whitehouse, Chenxi, Ruder, Sebastian, Lin, Tony, Kurylo, Oksana, Takagi, Haruka, Lam, Janice, Busetto, Nicolò, Diaz, Denise, Guzmán, Francisco

arXiv.org Artificial IntelligenceNov-12-2025

Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.Dataset https://huggingface.co/datasets/facebook/menlo In order for LLMs to be most useful across the globe, they need to be able to provide high-quality responses in many languages. Responses should be relevant (Zhuang et al., 2024), factually accurate (Jacovi et al., 2025), and natural (Marchisio et al., 2024; Guo et al., 2025), among other considerations. Ultimately, for interaction in any language to be seamless, responses need to be indistinguishable from those of a native speaker (Novikova et al., 2016; Liu et al., 2021). Language proficiency in humans has traditionally been evaluated via standardized tests (Jamieson et al., 2000). While such tests have been applied to evaluating LLMs (Anil et al., 2023; Mayor-Rocher et al., 2024; Lothritz & Cabot, 2025), they are difficult to scale and do not readily correspond to real-world conversations.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.26601

Country:

Europe (1.00)
Asia (1.00)
North America > United States (0.93)

Genre: Research Report > New Finding (0.86)

Industry: Education > Assessment & Standards > Student Performance (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

XL-DURel: Finetuning Sentence Transformers for Ordinal Word-in-Context Classification

Yadav, Sachin, Schlechtweg, Dominik

arXiv.org Artificial IntelligenceNov-7-2025

We propose XL-DURel, a finetuned, multilingual Sentence Transformer model optimized for ordinal Word-in-Context classification. We test several loss functions for regression and ranking tasks managing to outperform previous models on ordinal and binary data with a ranking objective based on angular distance in complex space. We further show that binary WiC can be treated as a special case of ordinal WiC and that optimizing models for the general ordinal task improves performance on the more specific binary task. This paves the way for a unified treatment of WiC modeling across different task formulations.

computational linguistic, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.14578

Country:

Europe (1.00)
Asia > Middle East > UAE (0.29)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.70)

Add feedback

Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

Haldar, Rajarshi, Hockenmaier, Julia

arXiv.org Artificial IntelligenceNov-3-2025

As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2510.27106

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report > New Finding (0.88)

Industry:

Health & Medicine (0.67)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

Generative Large Language Models (gLLMs) in Content Analysis: A Practical Guide for Communication Research

Kravets-Meinke, Daria, Schmid-Petri, Hannah, Niemann, Sonja, Schmid, Ute

arXiv.org Artificial IntelligenceOct-29-2025

Generative Large Language Models (gLLMs), such as ChatGPT, are increasingly being used in communication research for content analysis. Studies show that gLLMs can outperform both crowd workers and trained coders, such as research assistants, on various coding tasks relevant to communication science, often at a fraction of the time and cost. Additionally, gLLMs can decode implicit meanings and contextual information, be instructed using natural language, deployed with only basic programming skills, and require little to no annotated data beyond a validation dataset - constituting a paradigm shift in automated content analysis. Despite their potential, the integration of gLLMs into the methodological toolkit of communication research remains underdeveloped. In gLLM-assisted quantitative content analysis, researchers must address at least seven critical challenges that impact result quality: (1) codebook development, (2) prompt engineering, (3) model selection, (4) parameter tuning, (5) iterative refinement, (6) validation of the model's reliability, and optionally, (7) performance enhancement. This paper synthesizes emerging research on gLLM-assisted quantitative content analysis and proposes a comprehensive best-practice guide to navigate these challenges. Our goal is to make gLLM-based content analysis more accessible to a broader range of communication researchers and ensure adherence to established disciplinary quality standards of validity, reliability, reproducibility, and research ethics.

data mining, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2510.24337

Country: Europe > Austria (0.28)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Populism Meets AI: Advancing Populism Research with LLMs

Jung, Yujin J., Tamaki, Eduardo Ryô, Chatterley, Julia, Mitchell, Grant, Dzebo, Semir, Sandoval, Cristóbal, Littvay, Levente, Hawkins, Kirk A.

arXiv.org Artificial IntelligenceOct-28-2025

Measuring the ideational content of populism remains a challenge. Traditional strategies based on textual analysis have been critical for building the field's foundations and providing a valid, objective indicator of populist framing. Yet these approaches are costly, time consuming, and difficult to scale across languages, contexts, and large corpora. Here we present the results from a rubric and anchor guided chain of thought (CoT) prompting approach that mirrors human coder training. By leveraging the Global Populism Database (GPD), a comprehensive dataset of global leaders' speeches annotated for degrees of populism, we replicate the process used to train human coders by prompting the LLM with an adapted version of the same documentation to guide the model's reasoning. We then test multiple proprietary and open weight models by replicating scores in the GPD. Our findings reveal that this domain specific prompting strategy enables the LLM to achieve classification accuracy on par with expert human coders, demonstrating its ability to navigate the nuanced, context sensitive aspects of populism.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.07458

Country:

North America > United States > California (0.46)
Asia > Middle East > Republic of Türkiye (0.28)
Europe > United Kingdom > England (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

krippendorff

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

11e3e0f1b29dcd31bd0952bfc1357f68-Supplemental-Datasets_and_Benchmarks.pdf

11e3e0f1b29dcd31bd0952bfc1357f68-Supplemental-Datasets_and_Benchmarks.pdf

Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage

Text Annotation via Inductive Coding: Comparing Human Experts to LLMs in Qualitative Data Analysis

Efficiently Transforming Neural Networks into Decision Trees: A Path to Ground Truth Explanations with RENTT

MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

XL-DURel: Finetuning Sentence Transformers for Ordinal Word-in-Context Classification

Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

Generative Large Language Models (gLLMs) in Content Analysis: A Practical Guide for Communication Research

Populism Meets AI: Advancing Populism Research with LLMs