Goto

Collaborating Authors

 difficulty rating


Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Neural Information Processing Systems

Aiming to address this limitation, we present Easy2Hard-Bench, a consistently formatted collection of 6 benchmark datasets spanning various domains, such as mathematics and programming problems, chess puzzles, and reasoning questions.



FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

arXiv.org Artificial Intelligence

Recent AI systems have demonstrated remarkable proficiency in tackling challenging mathematical tasks, from achieving olympiad-level performance in geometry (Trinh et al. 2024) to improving upon existing research results in combinatorics (Romera-Paredes et al. 2024). However, existing benchmarks face some limitations: Saturation of existing benchmarks Current standard mathematics benchmarks such as the MATH dataset (Hendrycks, Burns, Kadavath, et al. 2021) and GSM8K (Cobbe et al. 2021) primarily assess competency at the high-school and early undergraduate level. As state-of-the-art models achieve near-perfect performance on these benchmarks, we lack rigorous ways to evaluate their capabilities in advanced mathematical domains that require deeper theoretical understanding, creative insight, and specialized expertise. Furthermore, to assess AI's potential contributions to mathematics research, we require benchmarks that better reflect the challenges faced by working mathematicians. Benchmark contamination in training data A significant challenge in evaluating large language models (LLMs) is data contamination--the inadvertent inclusion of benchmark problems in training data.


Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

arXiv.org Artificial Intelligence

While generalization over tasks from easy to hard is crucial to profile language models (LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad range of complexity are still blank. Aiming to address this limitation, we present Easy2Hard-Bench, a consistently formatted collection of 6 benchmark datasets spanning various domains, such as mathematics and programming problems, chess puzzles, and reasoning questions. Each problem within these datasets is annotated with numerical difficulty scores. To systematically estimate problem difficulties, we collect abundant performance data on attempts to each problem by humans in the real world or LLMs on the prominent leaderboard. Leveraging the rich performance data, we apply well-established difficulty ranking systems, such as Item Response Theory (IRT) and Glicko-2 models, to uniformly assign numerical difficulty scores to problems. Moreover, datasets in Easy2Hard-Bench distinguish themselves from previous collections by a higher proportion of challenging problems. Through extensive experiments with six state-of-the-art LLMs, we provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty, with the aim of inspiring future research in LLM generalization. The datasets are available at https://huggingface.co/datasets/furonghuang-lab/Easy2Hard-Bench.


Digital Comprehensibility Assessment of Simplified Texts among Persons with Intellectual Disabilities

arXiv.org Artificial Intelligence

Text simplification refers to the process of increasing the comprehensibility of texts. Automatic text simplification models are most commonly evaluated by experts or crowdworkers instead of the primary target groups of simplified texts, such as persons with intellectual disabilities. We conducted an evaluation study of text comprehensibility including participants with and without intellectual disabilities reading unsimplified, automatically and manually simplified German texts on a tablet computer. We explored four different approaches to measuring comprehensibility: multiple-choice comprehension questions, perceived difficulty ratings, response time, and reading speed. The results revealed significant variations in these measurements, depending on the reader group and whether the text had undergone automatic or manual simplification. For the target group of persons with intellectual disabilities, comprehension questions emerged as the most reliable measure, while analyzing reading speed provided valuable insights into participants' reading behavior.


What might matter in autonomous cars adoption: first person versus third person scenarios

arXiv.org Artificial Intelligence

The discussion between the automotive industry, governments, ethicists, policy makers and general public about autonomous cars' moral agency is widening, and therefore we see the need to bring more insight into what meta-factors might actually influence the outcomes of such discussions, surveys and plebiscites. In our study, we focus on the psychological (personality traits), practical (active driving experience), gender and rhetoric/framing factors that might impact and even determine respondents' a priori preferences of autonomous cars' operation. We conducted an online survey (N=430) to collect data that show that the third person scenario is less biased than the first person scenario when presenting ethical dilemma related to autonomous cars. According to our analysis, gender bias should be explored in more extensive future studies as well. We recommend any participatory technology assessment discourse to use the third person scenario and to direct attention to the way any autonomous car related debate is introduced, especially in terms of linguistic and communication aspects and gender.


Difficulty Rating of Sudoku Puzzles: An Overview and Evaluation

arXiv.org Artificial Intelligence

How can we predict the difficulty of a Sudoku puzzle? We give an overview of difficulty rating metrics and evaluate them on extensive dataset on human problem solving (more then 1700 Sudoku puzzles, hundreds of solvers). The best results are obtained using a computational model of human solving activity. Using the model we show that there are two sources of the problem difficulty: complexity of individual steps (logic operations) and structure of dependency among steps. We also describe metrics based on analysis of solutions under relaxed constraints -- a novel approach inspired by phase transition phenomenon in the graph coloring problem. In our discussion we focus not just on the performance of individual metrics on the Sudoku puzzle, but also on their generalizability and applicability to other problems.


Difficulty Rating of Sudoku Puzzles by a Computational Model

AAAI Conferences

We discuss and evaluate metrics for difficulty rating of Sudoku puzzles. The correlation coefficient with human performance for our best metric is 0.95. The data on human performance were obtained from three web portals and they comprise thousands of hours of human solving over 2000 problems. We provide a simple computational model of human solving activity and evaluate it over collected data. Using the model we show that there are two sources of problem difficulty: complexity of individual steps (logic operations) and structure of dependency among steps. Beside providing a very good Sudoku-tuned metric, we also discuss a metric with few Sudoku-specific details, which still provides good results (correlation coefficient is 0.88). Hence we believe that the approach should be applicable to difficulty rating of other constraint satisfaction problems.