AITopics | Instructional Material

To address this need, we contribute Holistic Evaluation of Multimodal Models ( HEMM), visualized in Figure 1. HEMM, as an evaluation framework, goes beyond conventional lists of datasets to emphasize holistic benchmarking at three levels.

access restriction, arxiv preprint arxiv, dataset, (13 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > New York (0.04)
(5 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)
Instructional Material (0.67)

Industry:

Media > Film (1.00)
Information Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.93)
(5 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(6 more...)

Add feedback

Beyond Task Diversity: Provable Representation Transfer for Sequential Multi-Task Linear Bandits

Neural Information Processing SystemsOct-10-2025, 00:37:15 GMT

Current literature typically assumes that the tasks are diverse, i.e., their parameters uniformly span the

algorithm, assumption, task diversity assumption, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Arizona (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Illinois (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report > Experimental Study (1.00)
Instructional Material (0.67)

Industry: Education > Educational Setting (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

On the optimization dynamics of RLVR: Gradient gap and step size thresholds

Suk, Joe, Duan, Yaqi

arXiv.org Machine LearningOct-10-2025

Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has shown significant empirical success. However, a principled understanding of why it works has been lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. We validate these predictions through controlled bandit simulations and LLM experiments, including training Qwen2.5-7B with GRPO.

arxiv preprint arxiv, exp, inequality, (15 more...)

arXiv.org Machine Learning

2510.08539

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report (0.50)
Instructional Material (0.46)
Workflow (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

From Handwriting to Feedback: Evaluating VLMs and LLMs for AI-Powered Assessment in Indonesian Classrooms

Aisyah, Nurul, Kautsar, Muhammad Dehan Al, Hidayat, Arif, Chowdhury, Raqib, Koto, Fajri

arXiv.org Artificial IntelligenceOct-10-2025

Despite rapid progress in vision-language and large language models (VLMs and LLMs), their effectiveness for AI-driven educational assessment in real-world, underrepresented classrooms remains largely unexplored. We evaluate state-of-the-art VLMs and LLMs on over 14K handwritten answers from grade-4 classrooms in Indonesia, covering Mathematics and English aligned with the local national curriculum. Unlike prior work on clean digital text, our dataset features naturally curly, diverse handwriting from real classrooms, posing realistic visual and linguistic challenges. Assessment tasks include grading and generating personalized Indonesian feedback guided by rubric-based evaluation. Results show that the VLM struggles with handwriting recognition, causing error propagation in LLM grading, yet LLM feedback remains pedagogically useful despite imperfect visual inputs, revealing limits in personalization and contextual relevance.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.04822

Country: Asia > Indonesia > Sumatra (0.14)

Genre:

Instructional Material > Online (0.34)
Instructional Material > Course Syllabus & Notes (0.34)
Research Report > New Finding (0.34)

Industry:

Education > Educational Setting (0.94)
Education > Curriculum > Subject-Specific Education (0.93)
Education > Assessment & Standards > Student Performance (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Bayesian Decision Making around Experts

Ornia, Daniel Jarne, Dyer, Joel, Bishop, Nicholas, Calinescu, Anisoara, Wooldridge, Michael

arXiv.org Artificial IntelligenceOct-10-2025

Complex learning agents are increasingly deployed alongside existing experts, such as human operators or previously trained agents. However, it remains unclear how should learners optimally incorporate certain forms of expert data, which may differ in structure from the learner's own action-outcome experiences. We study this problem in the context of Bayesian multi-armed bandits, considering: (i) offline settings, where the learner receives a dataset of outcomes from the expert's optimal policy before interaction, and (ii) simultaneous settings, where the learner must choose at each step whether to update its beliefs based on its own experience, or based on the outcome simultaneously achieved by an expert. We formalize how expert data influences the learner's posterior, and prove that pretraining on expert outcomes tightens information-theoretic regret bounds by the mutual information between the expert data and the optimal action. For the simultaneous setting, we propose an information-directed rule where the learner processes the data source that maximizes their one-step information gain about the optimal action. Finally, we propose strategies for how the learner can infer when to trust the expert and when not to, safeguarding the learner for the cases where the expert is ineffective or compromised. By quantifying the value of expert data, our framework provides practical, information-theoretic algorithms for agents to intelligently decide when to learn from others.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.08113

Country: North America > United States (0.14)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
Information Technology > Data Science > Data Mining > Big Data (0.66)

Add feedback