Goto

Collaborating Authors

 Paritosh, Praveen


DMLR: Data-centric Machine Learning Research -- Past, Present and Future

arXiv.org Artificial Intelligence

Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.


Modeling subjectivity (by Mimicking Annotator Annotation) in toxic comment identification across diverse communities

arXiv.org Artificial Intelligence

The prevalence and impact of toxic discussions online have made content moderation crucial.Automated systems can play a vital role in identifying toxicity, and reducing the reliance on human moderation.Nevertheless, identifying toxic comments for diverse communities continues to present challenges that are addressed in this paper.The two-part goal of this study is to(1)identify intuitive variances from annotator disagreement using quantitative analysis and (2)model the subjectivity of these viewpoints.To achieve our goal, we published a new dataset\footnote{\url{https://github.com/XXX}} with expert annotators' annotations and used two other public datasets to identify the subjectivity of toxicity.Then leveraging the Large Language Model(LLM),we evaluate the model's ability to mimic diverse viewpoints on toxicity by varying size of the training data and utilizing same set of annotators as the test set used during model training and a separate set of annotators as the test set.We conclude that subjectivity is evident across all annotator groups, demonstrating the shortcomings of majority-rule voting. Moving forward, subjective annotations should serve as ground truth labels for training models for domains like toxicity in diverse communities.


DataPerf: Benchmarks for Data-Centric AI Development

arXiv.org Artificial Intelligence

Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.


Data Excellence for AI: Why Should You Care

arXiv.org Artificial Intelligence

The efficacy of machine learning (ML) models depends on both algorithms and data. Training data defines what we want our models to learn, and testing data provides the means by which their empirical progress is measured. Benchmark datasets define the entire world within which models exist and operate, yet research continues to focus on critiquing and improving the algorithmic aspect of the models rather than critiquing and improving the data with which our models operate. If "data is the new oil," we are still missing work on the refineries by which the data itself could be optimized for more effective use.


Cross-replication Reliability -- An Empirical Approach to Interpreting Inter-rater Reliability

arXiv.org Artificial Intelligence

We present a new approach to interpreting IRR that is empirical and contextualized. It is based upon benchmarking IRR against baseline measures in a replication, one of which is a novel cross-replication reliability (xRR) measure based on Cohen's kappa. We call this approach the xRR framework. We opensource a replication dataset of 4 million human judgements of facial expressions and analyze it with the proposed framework. We argue this framework can be used to measure the quality of crowdsourced datasets.


The AI Bookie

AI Magazine

The AI Bookie column documents highlights from AI Bets, an online forum for the creation of adjudicatable predictions and bets about the future of AI. While it is easy to make a prediction about the future, this forum was created to help researchers craft predictions whose accuracy can be clearly and unambiguously judged when they come due. The bets will be documented on line, and regularly in this publication in The AI Bookie. We encourage bets that are rigorously and scientifically argued. It is common these days to hear laments about the loss of rigor in AI (for example, see Lipton and Steinhardt 2018), for researchers to point to the dramatic overspecialization and the tendency of communities to endlessly pursue derivative results well past the point of no return.


Characterizing Online Discussion Using Coarse Discourse Sequences

AAAI Conferences

In this work, we present a novel method for classifying comments in online discussions into a set of coarse discourse acts towards the goal of better understanding discussions at scale. To facilitate this study, we devise a categorization of coarse discourse acts designed to encompass general online discussion and allow for easy annotation by crowd workers. We collect and release a corpus of over 9,000 threads comprising over 100,000 comments manually annotated via paid crowdsourcing with discourse acts and randomly sampled from the site Reddit. Using our corpus, we demonstrate how the analysis of discourse acts can characterize different types of discussions, including discourse sequences such as Q&A pairs and chains of disagreement, as well as different communities. Finally, we conduct experiments to predict discourse acts using our corpus, finding that structured prediction models such as conditional random fields can achieve an F1 score of 75%. We also demonstrate how the broadening of discourse acts from simply question and answer to a richer set of categories can improve the recall performance of Q&A extraction.


Toward a Comprehension Challenge, Using Crowdsourcing as a Tool

AI Magazine

Human readers comprehend vastly more, and in vastly different ways, than any existing comprehension test would suggest. An ideal comprehension test for a story should cover the full range of questions and answers that humans would expect other humans to reasonably learn or infer from a given story. As a step toward these goals we propose a novel test, the Crowdsourced Comprehension Challenge (C3), which is constructed by repeated runs of a three-person game, the Iterative Crowdsourced Comprehension Game (ICCG). ICCG uses structured crowdsourcing to comprehensively generate relevant questions and supported answers for arbitrary stories, whether fiction or nonfiction, presented across a variety of media such as videos, podcasts, and still images.


Rigorously Collecting Commonsense Judgments for Complex Question-Answer Content

AAAI Conferences

Community Question Answering (CQA) websites are a popular tool for internet users to fulfill diverse information needs. Posted questions can be multiple sentences long and span diverse domains. They go beyond factoid questions and can be conversational, opinion-seeking and experiential questions, that might have multiple, potentially conflicting, useful answers from different users. In this paper, we describe a large-scale formative study to collect commonsense properties of questions and answers from 18 diverse communities from stackexchange.com. We collected 50,000 human judgments on 500 question-answer pairs. Commonsense properties are features that humans can extract and characterize reliably by using their commonsense knowledge and native language skills, and no special domain expertise is assumed. We report results and suggestions for designing human computation tasks for collecting commonsense semantic judgments.


Workshops Held at the First AAAI Conference on Human Computation and Crowdsourcing: A Report

AI Magazine

The first AAAI Conference on Human Computation and Crowdsourcing (HCOMP-2013) was be held November 6-9, 2013 in Palm Springs, California. Three workshops took place on Saturday, November 9th: Crowdsourcing at Scale (full day), Human and Machine Learning in Games (full day) and Scaling Speech, Language Understanding and Dialogue through Crowdsourcing (half day).