AITopics | evaluation performance

Collaborating Authors

evaluation performance

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

51f4efbfb3e18f4ea053c4d3d282c4e2-Supplemental.pdf

Neural Information Processing SystemsFeb-8-2026, 10:45:38 GMT

iteration, l2o-dm-aug, learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Travis County > Austin (0.05)
North America > Canada > Ontario > Toronto (0.05)
Asia > China > Shanghai > Shanghai (0.05)

Genre: Research Report > New Finding (0.36)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

Zhan, Runzhe, Huang, Zhihong, Yang, Xinyi, Chao, Lidia S., Yang, Min, Wong, Derek F.

arXiv.org Artificial IntelligenceOct-24-2025

Recent advancements in large reasoning models (LRMs) have introduced an intermediate "thinking" process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to "overthink" simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.2078

Country: Asia (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Early Detection and Reduction of Memorisation for Domain Adaptation and Instruction Tuning

Slack, Dean L., Moubayed, Noura Al

arXiv.org Artificial IntelligenceOct-14-2025

Most defences target the pre-training stage, leaving memorisation during fine-tuning--especially for domain adaptation and instruction tuning--poorly understood. We fine-tune Pythia, Llama3, and Mistral models spanning 1.4B-70B parameters on common evaluation datasets and track verbatim memorisation throughout training. We find that memorisation increases dramatically in the first few epochs, often significantly before either validation perplexity or evaluation performance is op-timised. We use a simple but effective n-gram memorisation score which reliably precedes verbatim memorisation; using it as an early-stopping criterion mitigates memorisation with minimal performance loss. Further, we introduce an n-gram-aware loss regulariser and show that it reduces memorisation across all model families tested by up to 40% while minimising evaluation performance trade-offs when compared to an existing memorisation mitigation strategy. These results yield practical, scalable insights into memorisation dynamics during language model fine-tuning.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.11372

Country: North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

Supplementary Materials: Training Stronger Baselines for Learning to Optimize Tianlong Chen

Neural Information Processing SystemsOct-2-2025, 22:28:43 GMT

L2O-DM-CL donates the enhanced L2O-DM with our proposed curriculum learning technique. All learnable optimizers are trained with 5000 epochs. The results are presented in figure A2. We observe that the model trained by curriculum learning outperforms the two baselines (i.e., L2O-DM and L2O-DM-AUG) with Curves are the average of ten runs. Evaluation performance of our enhanced L2O and previous SOT As (i.e., log training loss v.s.

artificial intelligence, l2o-dm-aug, machine learning, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas (0.16)
North America > Canada (0.16)
Asia > China (0.16)

Genre: Research Report > New Finding (0.36)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Fantastic Pretraining Optimizers and Where to Find Them

Wen, Kaiyue, Hall, David, Ma, Tengyu, Liang, Percy

arXiv.org Machine LearningSep-3-2025

AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups. To address these two issues, we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1-8x the Chinchilla optimum). We find that fair and informative comparisons require rigorous hyperparameter tuning and evaluations across a range of model scales and data-to-model ratios, performed at the end of training. First, optimal hyperparameters for one optimizer may be suboptimal for another, making blind hyperparameter transfer unfair. Second, the actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1x for 1.2B parameter models. Thirdly, comparing intermediate checkpoints before reaching the target training budgets can be misleading, as rankings between two optimizers can flip during training due to learning rate decay. Through our thorough investigation, we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners -- multiplying gradients with matrices rather than entry-wise scalars. However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4x over AdamW for 0.1B parameter models to merely 1.1x for 1.2B parameter models.

large language model, machine learning, natural language, (17 more...)

arXiv.org Machine Learning

2509.02046

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Middle East > Jordan (0.04)
(5 more...)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Unsupervised Visual Representation Learning via Mutual Information Regularized Assignment

Neural Information Processing SystemsAug-18-2025, 11:49:33 GMT

We derive a fixed-point iteration method and prove its convergence to the optimal solution.

artificial intelligence, machine learning, representation, (18 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

A Appendix Radial

Neural Information Processing SystemsAug-17-2025, 21:20:57 GMT

Even if we overcome the integration issues, we are still faced with challenges defining the overlap term. Existing evaluation methods in previous works do not directly measure reward. These evaluation methods primarily focus on not changing agent's original actions under adversarial When most actions don't change under attack, the reward is also In addition a full description of A WC is provided in Algorithm 2.Algorithm 2: Absolute Worst-Case RewardS SA-DQN requires this to be 1 - this can cause issues when the natural Q-values differ by less than 1. We tested this new loss on BankHeist and RoadRunner for Atari. Full results are summarized in below Table 4.

artificial intelligence, machine learning, radial-dqn, (19 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)

Add feedback

b12a1d1014e952e676f5d6931d03241a-Paper-Conference.pdf

Neural Information Processing SystemsAug-17-2025, 20:01:49 GMT

artificial intelligence, machine learning, opponent, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas (0.05)
North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games > Poker (0.47)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Minos: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text

Zhang, Junzhe, Zhang, Huixuan, Hu, Xinyu, Lin, Li, Gao, Mingqi, Qiu, Shi, Wan, Xiaojun

arXiv.org Artificial IntelligenceJun-4-2025

Evaluation is important for multimodal generation tasks. With the rapid progress of MLLMs, there is growing interest in applying MLLMs to build general evaluation systems. However, existing work overlooks two aspects: (1) the development of evaluation capabilities for text-to-image (T2I) generation task, and (2) the incorporation of large-scale human evaluation data. In this paper, we introduce Minos-Corpus, a large-scale multimodal evaluation dataset that combines evaluation data from both human and GPT. The corpus contains evaluation data across both image-to-text(I2T) and T2I generation tasks. Based on this corpus, we propose Data Selection and Balance, Mix-SFT training methods, and apply DPO to develop Minos, a multimodal evaluation model built upon a 7B backbone. Minos achieves state-of-the-art (SoTA) performance among all open-source evaluation models of similar scale on the average of evaluation performance on all tasks, and outperforms all open-source and closed-source models on evaluation of T2I generation task. Extensive experiments demonstrate the importance of leveraging high-quality human evaluation data and jointly training on evaluation data from both I2T and T2I generation tasks.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2506.02494

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (0.73)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Filters

Collaborating Authors

evaluation performance

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

bedc61a9936af18cb51b7c5e8f3b89a3-Paper-Conference.pdf

51f4efbfb3e18f4ea053c4d3d282c4e2-Supplemental.pdf

Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

Early Detection and Reduction of Memorisation for Domain Adaptation and Instruction Tuning

Supplementary Materials: Training Stronger Baselines for Learning to Optimize Tianlong Chen

Fantastic Pretraining Optimizers and Where to Find Them

Unsupervised Visual Representation Learning via Mutual Information Regularized Assignment

A Appendix Radial

b12a1d1014e952e676f5d6931d03241a-Paper-Conference.pdf

Minos: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text