AITopics | difficulty rating

Collaborating Authors

difficulty rating

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Neural Information Processing SystemsFeb-12-2026, 20:57:21 GMT

Aiming to address this limitation, we present Easy2Hard-Bench, a consistently formatted collection of 6 benchmark datasets spanning various domains, such as mathematics and programming problems, chess puzzles, and reasoning questions.

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Country:

North America > United States > Maryland (0.04)
North America > United States > California (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Overview (1.00)
Research Report > New Finding (0.92)
Questionnaire & Opinion Survey (0.67)

Industry:

Law (1.00)
Education (1.00)
Information Technology > Security & Privacy (0.93)
(2 more...)

Technology:

Information Technology > Software (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Neural Information Processing SystemsOct-10-2025, 01:59:31 GMT

arxiv preprint arxiv, dataset, e2h-gsm8k, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Maryland (0.04)
North America > United States > California (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Overview (1.00)
Research Report > New Finding (0.92)
Questionnaire & Opinion Survey (0.67)

Industry:

Education (1.00)
Information Technology > Security & Privacy (0.93)
Law > Statutes (0.67)
(3 more...)

Technology:

Information Technology > Software (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Glazer, Elliot, Erdil, Ege, Besiroglu, Tamay, Chicharro, Diego, Chen, Evan, Gunning, Alex, Olsson, Caroline Falkman, Denain, Jean-Stanislas, Ho, Anson, Santos, Emily de Oliveira, Järviniemi, Olli, Barnett, Matthew, Sandler, Robert, Vrzala, Matej, Sevilla, Jaime, Ren, Qiuyu, Pratt, Elizabeth, Levine, Lionel, Barkley, Grant, Stewart, Natalie, Grechuk, Bogdan, Grechuk, Tetiana, Enugandla, Shreepranav Varma, Wildon, Mark

arXiv.org Artificial IntelligenceDec-19-2024

Recent AI systems have demonstrated remarkable proficiency in tackling challenging mathematical tasks, from achieving olympiad-level performance in geometry (Trinh et al. 2024) to improving upon existing research results in combinatorics (Romera-Paredes et al. 2024). However, existing benchmarks face some limitations: Saturation of existing benchmarks Current standard mathematics benchmarks such as the MATH dataset (Hendrycks, Burns, Kadavath, et al. 2021) and GSM8K (Cobbe et al. 2021) primarily assess competency at the high-school and early undergraduate level. As state-of-the-art models achieve near-perfect performance on these benchmarks, we lack rigorous ways to evaluate their capabilities in advanced mathematical domains that require deeper theoretical understanding, creative insight, and specialized expertise. Furthermore, to assess AI's potential contributions to mathematics research, we require benchmarks that better reflect the challenges faced by working mathematicians. Benchmark contamination in training data A significant challenge in evaluating large language models (LLMs) is data contamination--the inadvertent inclusion of benchmark problems in training data.

benchmark, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2411.04872

Country:

Europe > Switzerland > Basel-City > Basel (0.04)
Europe > United Kingdom > England > Leicestershire > Leicester (0.04)

Genre: Research Report > Promising Solution (0.48)

Industry: Education > Educational Setting (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Ding, Mucong, Deng, Chenghao, Choo, Jocelyn, Wu, Zichu, Agrawal, Aakriti, Schwarzschild, Avi, Zhou, Tianyi, Goldstein, Tom, Langford, John, Anandkumar, Anima, Huang, Furong

arXiv.org Artificial IntelligenceSep-26-2024

While generalization over tasks from easy to hard is crucial to profile language models (LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad range of complexity are still blank. Aiming to address this limitation, we present Easy2Hard-Bench, a consistently formatted collection of 6 benchmark datasets spanning various domains, such as mathematics and programming problems, chess puzzles, and reasoning questions. Each problem within these datasets is annotated with numerical difficulty scores. To systematically estimate problem difficulties, we collect abundant performance data on attempts to each problem by humans in the real world or LLMs on the prominent leaderboard. Leveraging the rich performance data, we apply well-established difficulty ranking systems, such as Item Response Theory (IRT) and Glicko-2 models, to uniformly assign numerical difficulty scores to problems. Moreover, datasets in Easy2Hard-Bench distinguish themselves from previous collections by a higher proportion of challenging problems. Through extensive experiments with six state-of-the-art LLMs, we provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty, with the aim of inspiring future research in LLM generalization. The datasets are available at https://huggingface.co/datasets/furonghuang-lab/Easy2Hard-Bench.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2409.18433

Country:

North America > United States > Maryland (0.04)
North America > United States > California (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Law (1.00)
Education (1.00)
Information Technology > Security & Privacy (0.93)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Digital Comprehensibility Assessment of Simplified Texts among Persons with Intellectual Disabilities

Säuberli, Andreas, Holzknecht, Franz, Haller, Patrick, Deilen, Silvana, Schiffl, Laura, Hansen-Schirra, Silvia, Ebling, Sarah

arXiv.org Artificial IntelligenceFeb-20-2024

Text simplification refers to the process of increasing the comprehensibility of texts. Automatic text simplification models are most commonly evaluated by experts or crowdworkers instead of the primary target groups of simplified texts, such as persons with intellectual disabilities. We conducted an evaluation study of text comprehensibility including participants with and without intellectual disabilities reading unsimplified, automatically and manually simplified German texts on a tablet computer. We explored four different approaches to measuring comprehensibility: multiple-choice comprehension questions, perceived difficulty ratings, response time, and reading speed. The results revealed significant variations in these measurements, depending on the reader group and whether the text had undergone automatic or manual simplification. For the target group of persons with intellectual disabilities, comprehension questions emerged as the most reliable measure, while analyzing reading speed provided valuable insights into participants' reading behavior.

participant, simplification, target group, (12 more...)

arXiv.org Artificial Intelligence

2402.13094

Country:

Europe > Switzerland > Zürich > Zürich (0.15)
North America > United States > New York > New York County > New York City (0.14)
North America > United States > Hawaii > Honolulu County > Honolulu (0.05)
(12 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.93)

Technology:

Information Technology > Human Computer Interaction (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Cognitive Science (0.68)

Add feedback

What might matter in autonomous cars adoption: first person versus third person scenarios

Zackova, Eva, Romportl, Jan

arXiv.org Artificial IntelligenceOct-17-2018

The discussion between the automotive industry, governments, ethicists, policy makers and general public about autonomous cars' moral agency is widening, and therefore we see the need to bring more insight into what meta-factors might actually influence the outcomes of such discussions, surveys and plebiscites. In our study, we focus on the psychological (personality traits), practical (active driving experience), gender and rhetoric/framing factors that might impact and even determine respondents' a priori preferences of autonomous cars' operation. We conducted an online survey (N=430) to collect data that show that the third person scenario is less biased than the first person scenario when presenting ethical dilemma related to autonomous cars. According to our analysis, gender bias should be explored in more extensive future studies as well. We recommend any participatory technology assessment discourse to use the third person scenario and to direct attention to the way any autonomous car related debate is introduced, especially in terms of linguistic and communication aspects and gender.

artificial intelligence, respondent, scenario, (15 more...)

arXiv.org Artificial Intelligence

1810.0746

Country:

North America > United States (1.00)
Europe (0.69)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)
Research Report > Experimental Study (0.95)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Information Technology > Robotics & Automation (1.00)
Automobiles & Trucks (1.00)

Technology: Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)

Add feedback

Difficulty Rating of Sudoku Puzzles: An Overview and Evaluation

Pelánek, Radek

arXiv.org Artificial IntelligenceMar-28-2014

How can we predict the difficulty of a Sudoku puzzle? We give an overview of difficulty rating metrics and evaluate them on extensive dataset on human problem solving (more then 1700 Sudoku puzzles, hundreds of solvers). The best results are obtained using a computational model of human solving activity. Using the model we show that there are two sources of the problem difficulty: complexity of individual steps (logic operations) and structure of dependency among steps. We also describe metrics based on analysis of solutions under relaxed constraints -- a novel approach inspired by phase transition phenomenon in the graph coloring problem. In our discussion we focus not just on the performance of individual metrics on the Sudoku puzzle, but also on their generalizability and applicability to other problems.

evolutionary algorithm, machine learning, puzzle, (16 more...)

arXiv.org Artificial Intelligence

1403.7373

Genre:

Research Report > Promising Solution (0.48)
Research Report > Experimental Study (0.46)

Industry: Leisure & Entertainment > Games > Sudoku (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

Add feedback

Difficulty Rating of Sudoku Puzzles by a Computational Model

Pelánek, Radek (Masaryk University Brno)

AAAI ConferencesMay-18-2011

We discuss and evaluate metrics for difficulty rating of Sudoku puzzles. The correlation coefficient with human performance for our best metric is 0.95. The data on human performance were obtained from three web portals and they comprise thousands of hours of human solving over 2000 problems. We provide a simple computational model of human solving activity and evaluate it over collected data. Using the model we show that there are two sources of problem difficulty: complexity of individual steps (logic operations) and structure of dependency among steps. Beside providing a very good Sudoku-tuned metric, we also discuss a metric with few Sudoku-specific details, which still provides good results (correlation coefficient is 0.88). Hence we believe that the approach should be applicable to difficulty rating of other constraint satisfaction problems.

difficulty rating, puzzle, sudoku puzzle, (13 more...)

AAAI Conferences

Twenty-Fourth International FLAIRS Conference

Country:

Europe > Czechia > South Moravian Region > Brno (0.04)
Asia > Vietnam > Hanoi > Hanoi (0.04)

Genre: Workflow (0.88)

Industry: Leisure & Entertainment > Games > Sudoku (1.00)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)

Add feedback