Goto

Collaborating Authors

 test taker


Fantastic Bugs and Where to Find Them in AI Benchmarks

Truong, Sang, Tu, Yuheng, Hardy, Michael, Reuel, Anka, Tang, Zeyu, Burapacheep, Jirayu, Perera, Jonathan, Uwakwe, Chibuike, Domingue, Ben, Haber, Nick, Koyejo, Sanmi

arXiv.org Artificial Intelligence

Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck for reliable evaluation. In this work, we introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review. Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance. This implies a unidimensional latent construct underlying the measurement experiment, yielding expected ranges for various statistics for each item. When empirically estimated values for these statistics fall outside the expected range for an item, the item is more likely to be problematic. Across nine widely used benchmarks, our method guides expert review to identify problematic questions with up to 84\% precision. In addition, we introduce an LLM-judge first pass to review questions, further reducing human effort. Together, these components provide an efficient and scalable framework for systematic benchmark revision.


Automatic Detection of Inauthentic Templated Responses in English Language Assessments

Samant, Yashad, Becker, Lee, Hellman, Scott, Behan, Bradley, Hughes, Sarah, Southerland, Joshua

arXiv.org Artificial Intelligence

Pearson Education, Inc. Author Note Correspondence concerning this article should be addressed to Lee Becker. Pearson affiliated authors can be reached at .@pearson.com. Sarah Hughes can be reached at sarah.hughes1@pearson.com. Joshua Southerland can be reached at josh.southerland@pearson.com In this study, we introduce the automated detection of inauthentic, templated responses (AuDITR) task, describe a machine learning-based approach to this task and illustrate the importance of regularly updating these models in production. Introduction English language proficiency (ELP) tests carry exceptionally high stakes because of how they influence access to employment, education and national residency status.


The Polish Vocabulary Size Test: A Novel Adaptive Test for Receptive Vocabulary Assessment

Fokin, Danil, Płużyczka, Monika, Golovin, Grigory

arXiv.org Artificial Intelligence

We present the Polish Vocabulary Size Test (PVST), a novel tool for assessing the receptive vocabulary size of both native and non-native Polish speakers. Based on Item Response Theory and Computerized Adaptive Testing, PVST dynamically adjusts to each test-taker's proficiency level, ensuring high accuracy while keeping the test duration short. To validate the test, a pilot study was conducted with 1.475 participants. Native Polish speakers demonstrated significantly larger vocabularies compared to non-native speakers. For native speakers, vocabulary size showed a strong positive correlation with age. The PVST is available online at myvocab.info/pl.


Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?

Säuberli, Andreas, Frassinelli, Diego, Plank, Barbara

arXiv.org Artificial Intelligence

Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.


After exam fiasco, California State Bar faces deeper financial crisis

Los Angeles Times

The California State Bar's botched roll out of a new exam -- a move that the cash-strapped agency made in the hopes of saving money -- could ultimately end up costing it an additional 5.6 million. Leah T. Wilson, executive director of the State Bar, told state lawmakers at a Senate Judiciary hearing Tuesday that the agency expects to pay around 3 million to offer free exams to test takers, an additional 2 million to book in-person testing sites in July, and 620,000 to return the test to its traditional system of multiple-choice questions in July. Wilson, who announced last week she will step down when her term ends this summer, revealed the costs during a 90-minute hearing called by Sen. Thomas J. Umberg (D-Orange), chair of the Senate Judiciary Committee, to find out what went so "spectacularly wrong." Chaos ensued in February when thousands of test takers seeking to practice law in California sat for the new exam. Some reported they couldn't log into the exam because online testing platforms repeatedly crashed.


Head of State Bar of California to step down after exam fiasco

Los Angeles Times

The State Bar of California announced Friday that its embattled leader, who has faced growing pressure to resign over the botched February roll out of a new bar exam, will step down in July. Leah T. Wilson, the agency's executive director, informed the Board of Trustees she will not seek another term in the position she has held on and off since 2017. She also apologized for her role in the February bar exam chaos. "Accountability is a bedrock principle for any leader," Wilson said in a statement. "At the end of the day, I am responsible for everything that occurs within the organization. Despite our best intentions, the experiences of applicants for the February Bar Exam simply were unacceptable, and I fully recognize the frustration and stress this experience caused. While there are no words to assuage those emotions, I do sincerely apologize."


Pressure grows on State Bar of California to revert to national exam format in July after botched exam

Los Angeles Times

An influential California legislator is pressuring the State Bar of California to ditch its new multiple-choice questions after a February bar exam debacle and revert to the traditional test format in July. "Given the catastrophe of the February bar, I think that going back to the methods that have been used for the last 50 years -- until we can adequately test what new methods may be employed -- is the appropriate way to go," Sen. Tom Umberg (D-Orange), chair of the state Senate Judiciary Committee, told The Times. Thousands of test takers seeking to practice law in California typically take the two-day bar exam in July. Reverting to the national system by the National Conference of Bar Examiners, which California has used since 1972, would be a major retreat for the embattled State Bar. Its new exam was rolled out this year as a cost-cutting measure and "historic agreement" that would offer test takers the choice of remote testing.


State Bar of California admits it used AI to develop exam questions, triggering new furor

Los Angeles Times

Nearly two months after hundreds of prospective California lawyers complained that their bar exams were plagued with technical problems and irregularities, the state's legal licensing body has caused fresh outrage by admitting that some multiple-choice questions were developed with the aid of artificial intelligence. The State Bar of California said in a news release Monday that it will ask the California Supreme Court to adjust test scores for those who took its February bar exam. But it declined to acknowledge significant problems with its multiple-choice questions -- even as it revealed that a subset of questions were recycled from a first-year law student exam, while others were developed with the assistance of AI by ACS Ventures, the State Bar's independent psychometrician. "The debacle that was the February 2025 bar exam is worse than we imagined," said Mary Basick, assistant dean of academic skills at UC Irvine Law School. Having the questions drafted by non-lawyers using ...


Reliable and Efficient Amortized Model-based Evaluation

Truong, Sang, Tu, Yuheng, Liang, Percy, Li, Bo, Koyejo, Sanmi

arXiv.org Artificial Intelligence

Comprehensive evaluations of language models (LM) during both development and deployment phases are necessary because these models possess numerous capabilities (e.g., mathematical reasoning, legal support, or medical diagnostic) as well as safety risks (e.g., racial bias, toxicity, or misinformation). The average score across a wide range of benchmarks provides a signal that helps guide the use of these LMs in practice. Currently, holistic evaluations are costly due to the large volume of benchmark questions, making frequent evaluations impractical. A popular attempt to lower the cost is to compute the average score on a subset of the benchmark. This approach, unfortunately, often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset. Item response theory (IRT) was designed to address this challenge, providing a reliable measurement by careful controlling for question difficulty. Unfortunately, question difficulty is expensive to estimate. Facing this challenge, we train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost. In addition, we leverage this difficulty predictor to further improve the evaluation efficiency through training a question generator given a difficulty level. This question generator is essential in adaptive testing, where, instead of using a random subset of the benchmark questions, informative questions are adaptively chosen based on the current estimation of LLM performance. Experiments on 22 common natural language benchmarks and 172 LMs show that this approach is more reliable and efficient compared to current common practice.


Test Security in Remote Testing Age: Perspectives from Process Data Analytics and AI

Hao, Jiangang, Fauss, Michael

arXiv.org Artificial Intelligence

The COVID-19 pandemic has accelerated the implementation and acceptance of remotely proctored high-stake assessments. While the flexible administration of the tests brings forth many values, it raises test security-related concerns. Meanwhile, artificial intelligence (AI) has witnessed tremendous advances in the last five years. Many AI tools (such as the very recent ChatGPT) can generate high-quality responses to test items. These new developments require test security research beyond the statistical analysis of scores and response time. Data analytics and AI methods based on clickstream process data can get us deeper insight into the test-taking process and hold great promise for securing remotely administered high-stakes tests. This chapter uses real-world examples to show that this is indeed the case.