Goto

Collaborating Authors

 multiple-choice question


Results on FAVOR Bench

Neural Information Processing Systems

Prompt Template: Generating QAPairs for Camera Motion (CM) Task You are a professional question designer focusing on temporal dynamics in videos, including camera movements, motions, activities, and interactions, rather than static content. You will receive detailed annotations about the temporal details of the entire video, with duration markers in parentheses after "camera_motion" and "motion_list". Based on these annotations, design 3 multiple-choice questions around the "Camera Motion" theme to test models' fine-grained video motion understanding, particularly: Understanding camera movement direction and focus changes in the video. Additionally, follow these question design guidelines: 1. If a video's "camera_motion" has only one element, such as "camera_motion": "static", or "camera_motion": "camera shaking (0-22)", skip this video and don't generate any content.


Appendix ATask Definitions

Neural Information Processing Systems

Table 3 outlines the and reasoning tasks included in the MMPerspective benchmark. Sample cases and representative questions are included to illustrate the task format and input style. We also show examples of perspective-invariant image operations for robustness evaluation in Figure 17, including cropping, masking, flipping, and rotation. Where is the vanishing point in this image? Critical Line Perception (CLP) 123 Figure 9 Determine which of the highlighted lines is the horizon line. Which line highlighted in the image is the Horizon Line?


My Son's Math Homework Is Essentially Just Pokรฉmon

The Atlantic - Technology

My Son's Math Homework Is Essentially Just Pokรฉmon Education games are taking over American classrooms. One afternoon earlier this year, my 11-year-old son was sitting at his laptop and working quietly on his math homework. At least, that's what he was supposed to be doing. When I glanced at his screen, equations were nowhere to be seen. He was controlling a monster in the midst of battle, casting magic spells to outduel an opposing player.


The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only The Falcon LLMTeam

Neural Information Processing Systems

This curation process is believed to be necessary to produce 5 performant models with broad zero-shot generalization abilities. However, as larger 6 models requiring pretraining on trillions of tokens are considered, it is unclear how 7 scalable is curation, and whether we will run out of unique high-quality data soon.






1bdcb065d40203a00bd39831153338bb-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

Our findings reveal that: I)LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III)Instruction-finetuning tends to increase the uncertainty of LLMs. These results underscore the significance of incorporating uncertainty into the evaluation of LLMs.


Reasoning Models Ace the CFA Exams

arXiv.org Artificial Intelligence

Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.