AITopics | evaluation result

Collaborating Authors

evaluation result

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CLAWS: Creativity detection for LLM-generated solutions using Attention Window of Sections

Neural Information Processing SystemsJun-22-2026, 15:48:32 GMT

Recent advances in enhancing the reasoning ability of Large Language Models (LLMs) have been remarkably successful. LLMs trained with Reinforcement Learning (RL) for reasoning demonstrate strong performance in challenging tasks such as mathematics and coding, even with relatively small model sizes. However, despite these impressive improvements in task accuracy, the assessment of creativity in LLM generations has been largely overlooked in reasoning tasks, in contrast to writing tasks. The lack of research on creativity assessment in reasoning primarily stems from two challenges: (1) the difficulty of defining the range of creativity, and (2) the necessity of human evaluation in the assessment process. To address these challenges, we propose CLAWS, a novel method that defines and classifies mathematical solutions into Typical, Creative, and Hallucinated categories without human evaluation, by leveraging attention weights across prompt sections and output. CLAWS outperforms five existing white-box detection methods--Perplexity, Logit Entropy, Window Entropy, Hidden Score, and Attention Score--on five 7-8B math RL models (DeepSeek, Qwen, Mathstral, OpenMath2, and Oreal). We validate CLAWS on 4,545 math problems collected from 181 math contests (A(J)HSME, AMC, AIME). Our code is available at https://github.com/kkt94/CLAWS.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: Asia (0.27)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

1 Supplementary Material

Neural Information Processing SystemsJun-17-2026, 23:11:01 GMT

To investigate this further, we first observe that Claude-3.7-Sonnet Figure 1 shows the average pass rate under budgets of 12,000, 10 14,000, 16,000, and 17,000 tokens. As the data demonstrate, enlarging the thinking budget yields no 11 appreciable improvement in performance. This finding underscores 14 the challenging nature of ENGDESIGN and suggests its value as a rigorous testbed for future efforts 15 to enhance LLMs' engineering design proficiency. Figure 1: Average pass rate (%) of Claude-3.7-Thinking

artificial intelligence, large language model, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.49)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

59d2eaa5842fa641ff9b8e4c7ff0f6ee-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsJun-17-2026, 12:19:52 GMT

While text-to-image models like GPT-4o-Image and FLUX are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-BENCH, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across six key perspectives: alignment, safety, image quality, bias, composition, and visualization. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs, and close-source VLMs on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding textimage alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language than numerical scales. Notably, human evaluations on end-to-end fine-tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-BENCH.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education (0.93)
Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Parameters

Neural Information Processing SystemsJun-15-2026, 21:53:10 GMT

Therepresentssymbol prompt-basedindicates adaptermethods,-based source63domain data may lead to overfitting and denotes partially fine-tuned methods, and de-poor63generalization to unseen domains.

artificial intelligence, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Country: Europe (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector

Neural Information Processing SystemsJun-14-2026, 13:35:27 GMT

LLM-as-a-Judge has emerged as a promising tool for automatically evaluating generated outputs, but its reliability is often undermined by potential biases in judgment. Existing efforts to mitigate these biases face key limitations: in-context learning-based methods fail to address rooted biases due to the evaluator's limited capacity for self-reflection, whereas fine-tuning is not applicable to all evaluator types, especially closed-source models. To address this challenge, we introduce the Reasoning-based Bias Detector (RBD), which is a plug-in module that identifies biased evaluations and generates structured reasoning to guide evaluator self-correction. Rather than modifying the evaluator itself, RBD operates externally and engages in an iterative process of bias detection and feedback-driven revision. To support its development, we design a complete pipeline consisting of biased dataset construction, supervision collection, distilled reasoning-based fine-tuning of RBD, and integration with LLM evaluators. We fine-tune four sizes of RBD models, ranging from 1.5B to 14B, and observe consistent performance improvements across all scales. Experimental results on 4 bias types--verbosity, position, bandwagon, and sentiment--evaluated using 8 LLM evaluators demonstrate RBD's strong effectiveness. For example, the RBD-8B model improves evaluation accuracy by an average of 18.5% and consistency by 10.9%, and surpasses prompting-based baselines and fine-tuned judges by 12.8% and 17.2%, respectively.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States (0.46)
Europe (0.28)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.92)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

Visual Programming for Text to Image Generation and Evaluation

Neural Information Processing SystemsApr-25-2026, 02:56:31 GMT

As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGEN, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on textlayout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Discussion of Evaluation Methodologies

Neural Information Processing SystemsApr-25-2026, 01:14:47 GMT

In previous research, there are plenty of arguments about textual backdoor evaluation, including diverse metrics and experiment settings. These valuable discussions motivate us to construct a rigorous benchmark and we highly appreciate their efforts. In this section, we briefly summarize existing opinions and provide a more detailed discussion on this topic. Table 9 summarizes the attackers OpenBackdoorimplements. Effectiveness Besides the mainstream ASR (also called LFR [20]) and CACC metrics, there are also other effectiveness metrics. Shen et al. [46] proposed to count the number of inserted triggers that can successfully flip the label. However, although inserting more triggers could benefit attack strength, the triggers also corrupt the sentences gradually, so it is also possible that the poisoned samples become "adversarial", and we can hardly distinguish. Shen et al. [45] also mentioned this issue, and they advised calculating the ASR difference between a poisoned model and a clean model as an effectiveness metric.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Industry: