Wang, Jingyan
What Can Natural Language Processing Do for Peer Review?
Kuznetsov, Ilia, Afzal, Osama Mohammed, Dercksen, Koen, Dycke, Nils, Goldberg, Alexander, Hope, Tom, Hovy, Dirk, Kummerfeld, Jonathan K., Lauscher, Anne, Leyton-Brown, Kevin, Lu, Sheng, Mausam, null, Mieskes, Margot, Névéol, Aurélie, Pruthi, Danish, Qu, Lizhen, Schwartz, Roy, Smith, Noah A., Solorio, Thamar, Wang, Jingyan, Zhu, Xiaodan, Rogers, Anna, Shah, Nihar B., Gurevych, Iryna
The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review -- a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time-consuming, and prone to error. Since the artifacts involved in peer review -- manuscripts, reviews, discussions -- are largely text-based, Natural Language Processing has great potential to improve reviewing. As the emergence of large language models (LLMs) has enabled NLP assistance for many new tasks, the discussion on machine-assisted peer review is picking up the pace. Yet, where exactly is help needed, where can NLP help, and where should it stand aside? The goal of our paper is to provide a foundation for the future efforts in NLP for peer-reviewing assistance. We discuss peer review as a general process, exemplified by reviewing at AI conferences. We detail each step of the process from manuscript submission to camera-ready revision, and discuss the associated challenges and opportunities for NLP assistance, illustrated by existing work. We then turn to the big challenges in NLP for peer review as a whole, including data acquisition and licensing, operationalization and experimentation, and ethical issues. To help consolidate community efforts, we create a companion repository that aggregates key datasets pertaining to peer review. Finally, we issue a detailed call for action for the scientific community, NLP and AI researchers, policymakers, and funding bodies to help bring the research in NLP for peer review forward. We hope that our work will help set the agenda for research in machine-assisted scientific quality control in the age of AI, within the NLP community and beyond.
Modeling and Correcting Bias in Sequential Evaluation
Wang, Jingyan, Pananjady, Ashwin
We consider the problem of sequential evaluation, in which an evaluator observes candidates in a sequence and assigns scores to these candidates in an online, irrevocable fashion. Motivated by the psychology literature that has studied sequential bias in such settings -- namely, dependencies between the evaluation outcome and the order in which the candidates appear -- we propose a natural model for the evaluator's rating process that captures the lack of calibration inherent to such a task. We conduct crowdsourcing experiments to demonstrate various facets of our model. We then proceed to study how to correct sequential bias under our model by posing this as a statistical inference problem. We propose a near-linear time, online algorithm for this task and prove guarantees in terms of two canonical ranking metrics. We also prove that our algorithm is information theoretically optimal, by establishing matching lower bounds in both metrics. Finally, we perform a host of numerical experiments to show that our algorithm often outperforms the de facto method of using the rankings induced by the reported scores, both in simulation and on the crowdsourcing data that we collected.
Perceptual adjustment queries and an inverted measurement paradigm for low-rank metric learning
Xu, Austin, McRae, Andrew D., Wang, Jingyan, Davenport, Mark A., Pananjady, Ashwin
We introduce a new type of query mechanism for collecting human feedback, called the perceptual adjustment query ( PAQ). Being both informative and cognitively lightweight, the PAQ adopts an inverted measurement scheme, and combines advantages from both cardinal and ordinal queries. We showcase the PAQ in the metric learning problem, where we collect PAQ measurements to learn an unknown Mahalanobis distance. This gives rise to a high-dimensional, low-rank matrix estimation problem to which standard matrix estimators cannot be applied. Consequently, we develop a two-stage estimator for metric learning from PAQs, and provide sample complexity guarantees for this estimator. We present numerical simulations demonstrating the performance of the estimator and its notable properties.
A Bayesian Robust Regression Method for Corrupted Data Reconstruction
Fan, Zheyi, Li, Zhaohui, Wang, Jingyan, Lin, Dennis K. J., Xiong, Xiao, Hu, Qingpei
Because of the widespread existence of noise and data corruption, recovering the true regression parameters with a certain proportion of corrupted response variables is an essential task. Methods to overcome this problem often involve robust least-squares regression, but few methods perform well when confronted with severe adaptive adversarial attacks. In many applications, prior knowledge is often available from historical data or engineering experience, and by incorporating prior information into a robust regression method, we develop an effective robust regression method that can resist adaptive adversarial attacks. First, we propose the novel TRIP (hard Thresholding approach to Robust regression with sImple Prior) algorithm, which improves the breakdown point when facing adaptive adversarial attacks. Then, to improve the robustness and reduce the estimation error caused by the inclusion of priors, we use the idea of Bayesian reweighting to construct the more robust BRHT (robust Bayesian Reweighting regression via Hard Thresholding) algorithm. We prove the theoretical convergence of the proposed algorithms under mild conditions, and extensive experiments show that under different types of dataset attacks, our algorithms outperform other benchmark ones. Finally, we apply our methods to a data-recovery problem in a real-world application involving a space solar array, demonstrating their good applicability.
Debiasing Evaluations That are Biased by Evaluations
Wang, Jingyan, Stelmakh, Ivan, Wei, Yuting, Shah, Nihar B.
It is common to aggregate information and evaluate items by collecting ratings on these items from people. In this work, we focus on the bias introduced by people's observable outcome or experience from the entity under evaluation, and we call it the "outcome-induced bias". Let describe this notion of bias with the help of two common applications - teaching evaluation and peer review. Many universities use student ratings for teaching evaluation. However, numerous studies have shown that student ratings are affected by the grading policy of the instructor [16, 26, 5]. For instance, as noted in [26, Chapter 4]: "...the effects of grades on teacher-course evaluations are both substantively and statistically important, and suggest that instructors can often double their odds of receiving high evaluations from students simply by awarding A's rather than B's or C's." As a consequence, the association between student ratings and teaching effectiveness can become negative [5], and student ratings serve as a poor predictor on the follow-on course achievement of the students [8, 6]: "...teachers who are associated with better subsequent performance receive worst evaluations from their students."
Stretching the Effectiveness of MLE from Accuracy to Bias for Pairwise Comparisons
Wang, Jingyan, Shah, Nihar B., Ravi, R.
A number of applications (e.g., AI bot tournaments, sports, peer grading, crowdsourcing) use pairwise comparison data and the Bradley-Terry-Luce (BTL) model to evaluate a given collection of items (e.g., bots, teams, students, search results). Past work has shown that under the BTL model, the widely-used maximum-likelihood estimator (MLE) is minimax-optimal in estimating the item parameters, in terms of the mean squared error. However, another important desideratum for designing estimators is fairness. In this work, we consider fairness modeled by the notion of bias in statistics. We show that the MLE incurs a suboptimal rate in terms of bias. We then propose a simple modification to the MLE, which "stretches" the bounding box of the maximum-likelihood optimizer by a small constant factor from the underlying ground truth domain. We show that this simple modification leads to an improved rate in bias, while maintaining minimax-optimality in the mean squared error. In this manner, our proposed class of estimators provably improves fairness represented by bias without loss in accuracy.
Your 2 is My 1, Your 3 is My 9: Handling Arbitrary Miscalibrations in Ratings
Wang, Jingyan, Shah, Nihar B.
Cardinal scores (numeric ratings) collected from people are well known to suffer from miscalibrations. A popular approach to address this issue is to assume simplistic models of miscalibration (such as linear biases) to de-bias the scores. This approach, however, often fares poorly because people's miscalibrations are typically far more complex and not well understood. In the absence of simplifying assumptions on the miscalibration, it is widely believed that the only useful information in the cardinal scores is the induced ranking. In this paper, inspired by the framework of Stein's shrinkage and empirical Bayes, we contest this widespread belief. Specifically, we consider cardinal scores with arbitrary (or even adversarially chosen) miscalibrations that is only required to be consistent with the induced ranking. We design estimators that despite making no assumptions on the miscalibration, surprisingly, strictly and uniformly outperform all possible estimators that rely on only the ranking. Our estimators are flexible in that they can be used as a plug-in for a variety of applications. Our results thus provide novel insights in the eternal debate between cardinal and ordinal data.
Joint-ViVo: Selecting and Weighting Visual Words Jointly for Bag-of-Features based Tissue Classification in Medical Images
Wang, Jingyan
Automatically classifying the tissues types of Region of Interest (ROI) in medical imaging has been an important application in Computer-Aided Diagnosis (CAD), such as classification of breast parenchymal tissue in the mammogram, classify lung disease patterns in High-Resolution Computed Tomography (HRCT) etc. Recently, bag-of-features method has shown its power in this field, treating each ROI as a set of local features. In this paper, we investigate using the bag-of-features strategy to classify the tissue types in medical imaging applications. Two important issues are considered here: the visual vocabulary learning and weighting. Although there are already plenty of algorithms to deal with them, all of them treat them independently, namely, the vocabulary learned first and then the histogram weighted. Inspired by Auto-Context who learns the features and classifier jointly, we try to develop a novel algorithm that learns the vocabulary and weights jointly. The new algorithm, called Joint-ViVo, works in an iterative way. In each iteration, we first learn the weights for each visual word by maximizing the margin of ROI triplets, and then select the most discriminate visual words based on the learned weights for the next iteration. We test our algorithm on three tissue classification tasks: identifying brain tissue type in magnetic resonance imaging (MRI), classifying lung tissue in HRCT images, and classifying breast tissue density in mammograms. The results show that Joint-ViVo can perform effectively for classifying tissues.