main idea
Text Annotation via Inductive Coding: Comparing Human Experts to LLMs in Qualitative Data Analysis
Parfenova, Angelina, Marfurt, Andreas, Denzler, Alexander, Pfeffer, Juergen
This paper investigates the automation of qualitative data analysis, focusing on inductive coding using large language models (LLMs). Unlike traditional approaches that rely on deductive methods with predefined labels, this research investigates the inductive process where labels emerge from the data. The study evaluates the performance of six open-source LLMs compared to human experts. As part of the evaluation, experts rated the perceived difficulty of the quotes they coded. The results reveal a peculiar dichotomy: human coders consistently perform well when labeling complex sentences but struggle with simpler ones, while LLMs exhibit the opposite trend. Additionally, the study explores systematic deviations in both human and LLM generated labels by comparing them to the golden standard from the test set. While human annotations may sometimes differ from the golden standard, they are often rated more favorably by other humans. In contrast, some LLMs demonstrate closer alignment with the true labels but receive lower evaluations from experts.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- (3 more...)
Savaal: Scalable Concept-Driven Question Generation to Enhance Human Learning
Noorbakhsh, Kimia, Chandler, Joseph, Karimi, Pantea, Alizadeh, Mohammad, Balakrishnan, Hari
Assessing and enhancing human learning through question-answering is vital, yet automating this process remains challenging. While large language models (LLMs) excel at summarization and query responses, their ability to generate meaningful questions for learners is underexplored. We propose Savaal, a scalable question-generation system with three objectives: (i) scalability, enabling question generation from hundreds of pages of text (ii) depth of understanding, producing questions beyond factual recall to test conceptual reasoning, and (iii) domain-independence, automatically generating questions across diverse knowledge areas. Instead of providing an LLM with large documents as context, Savaal improves results with a three-stage processing pipeline. Our evaluation with 76 human experts on 71 papers and PhD dissertations shows that Savaal generates questions that better test depth of understanding by 6.5X for dissertations and 1.5X for papers compared to a direct-prompting LLM baseline. Notably, as document length increases, Savaal's advantages in higher question quality and lower cost become more pronounced.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (14 more...)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Reviews: Adaptive GNN for Image Analysis and Editing
They introduce an adaptive GNN formulated as a label propagation system, which can be related to two CV operations: filtering and propagation. Their adaptive GNN is designed based on guided map, graph Laplacian and node weight. The guided map and node weight are associated with filtering and propagation diffusion task in computer vision, and kernel of graph Laplacian is related to the diffusion pattern in computer vision task. They applied their model for quotient image analysis (QIA) and designed various illumination editing tasks for faces and scenes.
AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models
As Large Language Models (LLMs) are pretrained on massive-scale corpora, the issue of data contamination has become increasingly severe, leading to potential overestimation of model performance during evaluation. To address this, we propose AdEval (Alignment-based Dynamic Evaluation), a dynamic data evaluation method aimed at mitigating the impact of data contamination on evaluation reliability. AdEval extracts key knowledge points and main ideas to align dynamically generated questions with static data's core concepts. It also leverages online search to provide detailed explanations of related knowledge points, thereby creating high-quality evaluation samples with robust knowledge support. Furthermore, AdEval incorporates mechanisms to control the number and complexity of questions, enabling dynamic alignment and flexible adjustment. This ensures that the generated questions align with the complexity of static data while supporting varied complexity levels. Based on Bloom's taxonomy, AdEval conducts a multi-dimensional evaluation of LLMs across six cognitive levels: remembering, understanding, applying, analyzing, evaluating, and creating. Experimental results on multiple datasets demonstrate that AdEval effectively reduces the impact of data contamination on evaluation outcomes, enhancing both the fairness and reliability of the evaluation process.
- Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
Rationale Behind Essay Scores: Enhancing S-LLM's Multi-Trait Essay Scoring with Rationale Generated by LLMs
Chu, SeongYeub, Kim, JongWoo, Wong, Bryan, Yi, MunYong
Existing automated essay scoring (AES) has solely relied on essay text without using explanatory rationales for the scores, thereby forgoing an opportunity to capture the specific aspects evaluated by rubric indicators in a fine-grained manner. This paper introduces Rationale-based Multiple Trait Scoring (RMTS), a novel approach for multi-trait essay scoring that integrates prompt-engineering-based large language models (LLMs) with a fine-tuning-based essay scoring model using a smaller large language model (S-LLM). RMTS uses an LLM-based trait-wise rationale generation system where a separate LLM agent generates trait-specific rationales based on rubric guidelines, which the scoring model uses to accurately predict multi-trait scores. Extensive experiments on benchmark datasets, including ASAP, ASAP++, and Feedback Prize, show that RMTS significantly outperforms state-of-the-art models and vanilla S-LLMs in trait-specific scoring. By assisting quantitative assessment with fine-grained qualitative rationales, RMTS enhances the trait-wise reliability, providing partial explanations about essays.
- Education > Educational Setting (1.00)
- Education > Assessment & Standards > Student Performance (1.00)
- Education > Educational Technology > Educational Software > Computer Based Training (0.34)
Reviews: Dynamic-Depth Context Tree Weighting
The paper develops a variation on Context Tree Weighting (CTW) which keeps memory costs low by adapting the depth of each branch to the extent that it aids prediction accuracy. The new algorithm, called Utile Context Tree Weighting (UCTW), is shown empirically in some illustrative examples to use less memory than fixed-depth CTW (since it can keep some branches short) and to be more effective under a memory bound (in which it must prune a node every time it expands a node). The experiments are, for the most part well designed to answer the questions being asked. One experiment that felt less well-posed was the T-Maze. The text says "We consider a maze of length 4. Thus we set K 3." What does that "thus" mean?
How Well Can You Articulate that Idea? Insights from Automated Formative Assessment
Karizaki, Mahsa Sheikhi, Gnesdilow, Dana, Puntambekar, Sadhana, Passonneau, Rebecca J.
Automated methods are becoming increasingly integrated into studies of formative feedback on students' science explanation writing. Most of this work, however, addresses students' responses to short answer questions. We investigate automated feedback on students' science explanation essays, where students must articulate multiple ideas. Feedback is based on a rubric that identifies the main ideas students are prompted to include in explanatory essays about the physics of energy and mass, given their experiments with a simulated roller coaster. We have found that students generally improve on revised versions of their essays. Here, however, we focus on two factors that affect the accuracy of the automated feedback. First, we find that the main ideas in the rubric differ with respect to how much freedom they afford in explanations of the idea, thus explanation of a natural law is relatively constrained. Students have more freedom in how they explain complex relations they observe in their roller coasters, such as transfer of different forms of energy. Second, by tracing the automated decision process, we can diagnose when a student's statement lacks sufficient clarity for the automated tool to associate it more strongly with one of the main ideas above all others. This in turn provides an opportunity for teachers and peers to help students reflect on how to state their ideas more clearly.
- North America > United States > Wisconsin > Dane County > Madison (0.14)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- North America > United States > Pennsylvania > Centre County > State College (0.04)
- (2 more...)
- Education > Educational Setting > K-12 Education (0.95)
- Education > Assessment & Standards > Assessment Methods (0.71)
- Education > Curriculum > Subject-Specific Education (0.71)
Interpreting Themes from Educational Stories
Zhang, Yigeng, González, Fabio A., Solorio, Thamar
Reading comprehension continues to be a crucial research focus in the NLP community. Recent advances in Machine Reading Comprehension (MRC) have mostly centered on literal comprehension, referring to the surface-level understanding of content. In this work, we focus on the next level - interpretive comprehension, with a particular emphasis on inferring the themes of a narrative text. We introduce the first dataset specifically designed for interpretive comprehension of educational narratives, providing corresponding well-edited theme texts. The dataset spans a variety of genres and cultural origins and includes human-annotated theme keywords with varying levels of granularity. We further formulate NLP tasks under different abstractions of interpretive comprehension toward the main idea of a story. After conducting extensive experiments with state-of-the-art methods, we found the task to be both challenging and significant for NLP research.
- Asia > India (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > New York (0.04)
- (17 more...)
GPT-4 Understands Discourse at Least as Well as Humans Do
Shultz, Thomas, Wise, Jamie, Nobandegani, Ardavan Salehi
MILA, Quebec AI Institute Abstract We test whether a leading AI system GPT-4 understands discourse as well as humans do, using a standardized test of discourse comprehension. Participants are presented with brief stories and then answer eight yes/no questions probing their comprehension of the story. The questions are formatted to assess the separate impacts of directness (stated vs. implied) and salience (main idea vs. details). GPT-4 performs slightly, but not statistically significantly, better than humans given the very high level of human performance. Both GPT-4 and humans exhibit a strong ability to make inferences about information that is not explicitly stated in a story, a critical test of understanding.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > New York (0.05)
- North America > Canada > Quebec > Montreal (0.05)
- North America > United States > Arizona > Pima County > Tucson (0.04)
- Education (0.48)
- Health & Medicine > Therapeutic Area > Neurology (0.30)
49b8b4f95f02e055801da3b4f58e28b7-Reviews.html
The novelty itself does not feel ground breaking due to this. The paper is also lacking in presentation. I can't see people outside a small community to follow this paper through without significant difficulties. I think the main ideas could be nicely summarised in one or two paragraphs but it is currently a pain to extract these; the notation is guesswork and will frustrate a reader who is not from the field or wants to go into details. There is no theory to support the density estimation point they make and the covariance approximation bounds are also not super significant since they make strong assumptions and the bounds seem to be not very tight. Also the link to paper [1] needs to be pointed out clearly with a proper discussion. On the plus side is Table 1 with the experimental results which seem promising.