Halfaker, Aaron
Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia
Kuo, Tzu-Sheng, Halfaker, Aaron, Cheng, Zirui, Kim, Jiwoo, Wu, Meng-Hsin, Wu, Tongshuang, Holstein, Kenneth, Zhu, Haiyi
AI tools are increasingly deployed in community contexts. However, datasets used to evaluate AI are typically created by developers and annotators outside a given community, which can yield misleading conclusions about AI performance. How might we empower communities to drive the intentional design and curation of evaluation datasets for AI that impacts them? We investigate this question on Wikipedia, an online community with multiple AI-based content moderation tools deployed. We introduce Wikibench, a system that enables communities to collaboratively curate AI evaluation datasets, while navigating ambiguities and differences in perspective through discussion. A field study on Wikipedia shows that datasets curated using Wikibench can effectively capture community consensus, disagreement, and uncertainty. Furthermore, study participants used Wikibench to shape the overall data curation process, including refining label definitions, determining data inclusion criteria, and authoring data statements. Based on our findings, we propose future directions for systems that support community-driven data curation.
On Improving Summarization Factual Consistency from Natural Language Feedback
Liu, Yixin, Deb, Budhaditya, Teruel, Milagro, Halfaker, Aaron, Radev, Dragomir, Awadallah, Ahmed H.
Despite the recent progress in language generation models, their outputs may not always meet user expectations. In this work, we study whether informational feedback in natural language can be leveraged to improve generation quality and user preference alignment. To this end, we consider factual consistency in summarization, the quality that the summary should only contain information supported by the input documents, as the user-expected preference. We collect a high-quality dataset, DeFacto, containing human demonstrations and informational natural language feedback consisting of corrective instructions, edited summaries, and explanations with respect to the factual consistency of the summary. Using our dataset, we study three natural language generation tasks: (1) editing a summary by following the human feedback, (2) generating human feedback for editing the original summary, and (3) revising the initial summary to correct factual errors by generating both the human feedback and edited summary. We show that DeFacto can provide factually consistent human-edited summaries and further insights into summarization factual consistency thanks to its informational natural language feedback. We further demonstrate that fine-tuned language models can leverage our dataset to improve the summary factual consistency, while large language models lack the zero-shot learning ability in our proposed tasks that require controllable text generation.
Summaries, Highlights, and Action items: Design, implementation and evaluation of an LLM-powered meeting recap system
Asthana, Sumit, Hilleli, Sagih, He, Pengcheng, Halfaker, Aaron
Meetings play a critical infrastructural role in the coordination of work. In recent years, due to shift to hybrid and remote work, more meetings are moving to online Computer Mediated Spaces. This has led to new problems (e.g. more time spent in less engaging meetings) and new opportunities (e.g. automated transcription/captioning and recap support). Recent advances in large language models (LLMs) for dialog summarization have the potential to improve the experience of meetings by reducing individuals' meeting load and increasing the clarity and alignment of meeting outputs. Despite this potential, they face technological limitation due to long transcripts and inability to capture diverse recap needs based on user's context. To address these gaps, we design, implement and evaluate in-context a meeting recap system. We first conceptualize two salient recap representations -- important highlights, and a structured, hierarchical minutes view. We develop a system to operationalize the representations with dialogue summarization as its building blocks. Finally, we evaluate the effectiveness of the system with seven users in the context of their work meetings. Our findings show promise in using LLM-based dialogue summarization for meeting recap and the need for both representations in different contexts. However, we find that LLM-based recap still lacks an understanding of whats personally relevant to participants, can miss important details, and mis-attributions can be detrimental to group dynamics. We identify collaboration opportunities such as a shared recap document that a high quality recap enables. We report on implications for designing AI systems to partner with users to learn and improve from natural interactions to overcome the limitations related to personal relevance and summarization quality.
Who Did What: Editor Role Identification in Wikipedia
Yang, Diyi (Carnegie Mellon University) | Halfaker, Aaron (Wikimedia Foundation) | Kraut, Robert (Carnegie Mellon University) | Hovy, Eduard (Carnegie Mellon University)
Understanding the social roles played by contributors to online communities can facilitate the process of task routing. In this work, we develop new techniques to find roles in Wikipedia based on editors' low-level edit types and investigate how work contributed by people from different roles affect the article quality. To do this, we first built machine-learning models to automatically identify the edit categories associated with edits. We then applied a graphical model analogous to Latent Dirichlet Allocation to uncover the latent roles in editors' edit histories. Applying this technique revealed eight different roles editors play. Finally, we validated how our identified roles collaborate to improve the quality of articles. The results demonstrate that editors carrying on different roles contribute differently in terms of edit categories and articles in different quality stages need different types of editors. Implications for editor role identification and the validation of role contribution are discussed.
Defense Mechanism or Socialization Tactic? Improving Wikipedia’s Notifications to Rejected Contributors
Geiger, R. Stuart (University of California, Berkeley) | Halfaker, Aaron (University of Minnesota) | Pinchuk, Maryana (Wikimedia Foundation) | Walling, Steven (Wikimedia Foundation)
Unlike traditional firms, open collaborative systems rely on volunteers to operate, and many communities struggle to maintain enough contributors to ensure the quality and quantity of content. However, Wikipedia has historically faced the exact opposite problem: too much participation, particularly from users who, knowingly or not, do not share the same norms as veteran Wikipedians. During its period of exponential growth, the Wikipedian community developed specialized socio-technical defense mechanisms to protect itself from the negatives of massive participation: spam, vandalism, falsehoods, and other damage. Yet recently, Wikipedia has faced a number of high-profile issues with recruiting and retaining new contributors. In this paper, we first illustrate and describe the various defense mechanisms at work in Wikipedia, which we hypothesize are inhibiting newcomer retention. Next, we present results from an experiment aimed at increasing both the quantity and quality of editors by altering various elements of these defense mechanisms, specifically pre-scripted warnings and notifications that are sent to new editors upon reverting or rejecting contributions. Using logistic regressions to model new user activity, we show which tactics work best for different populations of users based on their motivations when joining Wikipedia. In particular, we found that personalized messages in which Wikipedians identified themselves in active voice and took direct responsibility for rejecting an editor’s contributions were much more successful across a variety of outcome metrics than the current messages, which typically use an institutional and passive voice.