Goto

Collaborating Authors

 performance data


Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Neural Information Processing Systems

We estimate the difficulty of individual problems by leveraging the performance data of many human subjects and LLMs on prominent leaderboards. Harnessing the rich human performance data, we employ widely recognized difficulty ranking systems, including the Item Response Theory (IRT) and Glicko-2 models, to uniformly assign difficulty scores to problems. The Easy2Hard datasets distinguish themselves from previous collections by incorporating a significantly higher proportion of challenging problems, presenting a novel and demanding test for state-of-the-art LLMs. Through extensive experiments conducted with six state-of-the-art LLMs on the Easy2Hard datasets, we offer valuable insights into their performance and generalization capabilities across varying degrees of difficulty, setting the stage for future research in LLM generalization.


A Multi-Agent LLM Framework for Design Space Exploration in Autonomous Driving Systems

Shih, Po-An, Wang, Shao-Hua, Li, Yung-Che, Tu, Chia-Heng, Chang, Chih-Han

arXiv.org Artificial Intelligence

Designing autonomous driving systems requires efficient exploration of large hardware/software configuration spaces under diverse environmental conditions, e.g., with varying traffic, weather, and road layouts. Traditional design space exploration (DSE) approaches struggle with multi-modal execution outputs and complex performance trade-offs, and often require human involvement to assess correctness based on execution outputs. This paper presents a multi-agent, large language model (LLM)-based DSE framework, which integrates multi-modal reasoning with 3D simulation and profiling tools to automate the interpretation of execution outputs and guide the exploration of system designs. Specialized LLM agents are leveraged to handle user input interpretation, design point generation, execution orchestration, and analysis of both visual and textual execution outputs, which enables identification of potential bottlenecks without human intervention. A prototype implementation is developed and evaluated on a robotaxi case study (an SAE Level 4 autonomous driving application). Compared with a genetic algorithm baseline, the proposed framework identifies more Pareto-optimal, cost-efficient solutions with reduced navigation time under the same exploration budget. Experimental results also demonstrate the efficiency of the adoption of the LLM-based approach for DSE. We believe that this framework paves the way to the design automation of autonomous driving systems.


A Control-Theoretic Approach to Dynamic Payment Routing for Success Rate Optimization

Agrawal, Aniket, Patil, Harsharanga

arXiv.org Artificial Intelligence

This paper introduces a control-theoretic framework for dynamic payment routing, implemented within JUSPAY's Payment Orchestrator to maximize transaction success rate. The routing system is modeled as a closed-loop feedback controller continuously sensing gateway [3] performance, computing corrective actions, and dynamically routes transactions across gateway to ensure operational resilience. The system leverages concepts from control theory, reinforcement learning, and multi-armed bandit optimization to achieve both short-term responsiveness and long-term stability. Rather than relying on explicit PID regulation, the framework applies generalized feedback-based adaptation, ensuring that corrective actions remain proportional to observed performance deviations and the computed gateway score gradually converges toward the success rate [2]. This hybrid approach unifies control theory and adaptive decision systems, enabling self-regulating transaction routing that dampens instability, and improves reliability. Live production results show an improvement of up to 1.15% in success rate over traditional rule-based routing, demonstrating the effectiveness of feedback-based control in payment systems.


MAB Optimizer for Estimating Math Question Difficulty via Inverse CV without NLP

Das, Surajit, Roy, Gourav, Eliseev, Aleksei, Rajendran, Ram Kumar

arXiv.org Artificial Intelligence

The evolution of technology and education is driving the emergence of Intelligent & Autonomous Tutoring Systems (IATS), where objective and domain-agnostic methods for determining question difficulty are essential. Traditional human labeling is subjective, and existing NLP-based approaches fail in symbolic domains like algebra. This study introduces the Approach of Passive Measures among Educands (APME), a reinforcement learning-based Multi-Armed Bandit (MAB) framework that estimates difficulty solely from solver performance data -- marks obtained and time taken -- without requiring linguistic features or expert labels. By leveraging the inverse coefficient of variation as a risk-adjusted metric, the model provides an explainable and scalable mechanism for adaptive assessment. Empirical validation was conducted on three heterogeneous datasets. Across these diverse contexts, the model achieved an average R2 of 0.9213 and an average RMSE of 0.0584, confirming its robustness, accuracy, and adaptability to different educational levels and assessment formats. Compared with baseline approaches-such as regression-based, NLP-driven, and IRT models-the proposed framework consistently outperformed alternatives, particularly in purely symbolic domains. The findings highlight that (i) item heterogeneity strongly influences perceived difficulty, and (ii) variance in solver outcomes is as critical as mean performance for adaptive allocation. Pedagogically, the model aligns with Vygotskys Zone of Proximal Development by identifying tasks that balance challenge and attainability, supporting motivation while minimizing disengagement. This domain-agnostic, self-supervised approach advances difficulty tagging in IATS and can be extended beyond algebra wherever solver interaction data is available


Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization

Dong, Ximing, Wang, Shaowei, Lin, Dayi, Hassan, Ahmed E.

arXiv.org Artificial Intelligence

Optimizing Large Language Model (LLM) performance requires well-crafted prompts, but manual prompt engineering is labor-intensive and often ineffective. Automated prompt optimization techniques address this challenge but the majority of them rely on randomly selected evaluation subsets, which fail to represent the full dataset, leading to unreliable evaluations and suboptimal prompts. Existing coreset selection methods, designed for LLM benchmarking, are unsuitable for prompt optimization due to challenges in clustering similar samples, high data collection costs, and the unavailability of performance data for new or private datasets. To overcome these issues, we propose IPOMP, an Iterative evaluation data selection for effective Prompt Optimization using real-time Model Performance. IPOMP is a two-stage approach that selects representative and diverse samples using semantic clustering and boundary analysis, followed by iterative refinement with real-time model performance data to replace redundant samples. Evaluations on the BIG-bench dataset show that IPOMP improves effectiveness by 1.6% to 5.3% and stability by at least 57% compared with SOTA baselines, with minimal computational overhead below 1%. Furthermore, the results demonstrate that our real-time performance-guided refinement approach can be universally applied to enhance existing coreset selection methods.


Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Neural Information Processing Systems

We estimate the difficulty of individual problems by leveraging the performance data of many human subjects and LLMs on prominent leaderboards. Harnessing the rich human performance data, we employ widely recognized difficulty ranking systems, including the Item Response Theory (IRT) and Glicko-2 models, to uniformly assign difficulty scores to problems. The Easy2Hard datasets distinguish themselves from previous collections by incorporating a significantly higher proportion of challenging problems, presenting a novel and demanding test for state-of-the-art LLMs. Through extensive experiments conducted with six state-of-the-art LLMs on the Easy2Hard datasets, we offer valuable insights into their performance and generalization capabilities across varying degrees of difficulty, setting the stage for future research in LLM generalization.


JanusDDG: A Thermodynamics-Compliant Model for Sequence-Based Protein Stability via Two-Fronts Multi-Head Attention

Barducci, Guido, Rossi, Ivan, Codicè, Francesco, Rollo, Cesare, Repetto, Valeria, Pancotti, Corrado, Iannibelli, Virginia, Sanavia, Tiziana, Fariselli, Piero

arXiv.org Artificial Intelligence

Understanding how residue variations affect protein stability is crucial for designing functional proteins and deciphering the molecular mechanisms underlying disease-related mutations. Recent advances in protein language models (PLMs) have revolutionized computational protein analysis, enabling, among other things, more accurate predictions of mutational effects. In this work, we introduce JanusDDG, a deep learning framework that leverages PLM-derived embeddings and a bidirectional cross-attention transformer architecture to predict $ΔΔG$ of single and multiple-residue mutations while simultaneously being constrained to respect fundamental thermodynamic properties, such as antisymmetry and transitivity. Unlike conventional self-attention, JanusDDG computes queries (Q) and values (V) as the difference between wild-type and mutant embeddings, while keys (K) alternate between the two. This cross-interleaved attention mechanism enables the model to capture mutation-induced perturbations while preserving essential contextual information. Experimental results show that JanusDDG achieves state-of-the-art performance in predicting $ΔΔG$ from sequence alone, matching or exceeding the accuracy of structure-based methods for both single and multiple mutations. Code Availability:https://github.com/compbiomed-unito/JanusDDG


Generative Data Imputation for Sparse Learner Performance Data Using Generative Adversarial Imputation Networks

Zhang, Liang, Lin, Jionghao, Sabatini, John, Zapata-Rivera, Diego, Forsyth, Carol, Jiang, Yang, Hollander, John, Hu, Xiangen, Graesser, Arthur C.

arXiv.org Artificial Intelligence

DV ANCEMENTS in AI-driven technologies have significantly enhanced modern education through personalized tutoring and adaptive learning strategies on online platforms [1], [2]. Intelligent T utoring Systems (ITSs) exemplify this progress by leveraging advanced machine learning and natural language processing models to create interactive learning environments that improve outcomes across domains like literacy [3], mathematics [4], language learning [5], biology [6] and other STEM fields [7]. As human learners interact with ITSs, often through question-and-answer scenarios with immediate responses, their performance data becomes crucial for learner modeling, enabling systems to track progress, predict future performance, and adapt instruction accordingly [8]. Learner models like Bayesian Knowledge Tracing (BKT) and other knowledge tracing variants utilize the learner performance data to uncover learning characteristics, estimate knowledge states and acquisition [9]. However, in real-world scenarios, missing learner performance data is prevalent due to factors, such as learner dropout or disengagement [10], technical issues or incomplete data logging [11], biased sampling within experimental groups [12], and more. These challenges often lead to sparse data, where items (i.e., questions or problems) remain unattempted (e.g., learners may bypass the question, leave it unanswered due to a lack of response initiation, or make no attempt to engage with it), alongside limited learner interactions [13], [14]. As shown in Figure 1, missing performance records can occur along both the attempt and question dimensions during learner-ITS interactions. In the right portion of the figure's two matrices, entries marked with "?


Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI

Zhang, Liang, Lin, Jionghao, Sabatini, John, Borchers, Conrad, Weitekamp, Daniel, Cao, Meng, Hollander, John, Hu, Xiangen, Graesser, Arthur C.

arXiv.org Artificial Intelligence

Learning performance data describe correct and incorrect answers or problem-solving attempts in adaptive learning, such as in intelligent tutoring systems (ITSs). Learning performance data tend to be highly sparse (80\%\(\sim\)90\% missing observations) in most real-world applications due to adaptive item selection. This data sparsity presents challenges to using learner models to effectively predict future performance explore new hypotheses about learning. This article proposes a systematic framework for augmenting learner data to address data sparsity in learning performance data. First, learning performance is represented as a three-dimensional tensor of learners' questions, answers, and attempts, capturing longitudinal knowledge states during learning. Second, a tensor factorization method is used to impute missing values in sparse tensors of collected learner data, thereby grounding the imputation on knowledge tracing tasks that predict missing performance values based on real observations. Third, a module for generating patterns of learning is used. This study contrasts two forms of generative Artificial Intelligence (AI), including Generative Adversarial Networks (GANs) and Generate Pre-Trained Transformers (GPT) to generate data associated with different clusters of learner data. We tested this approach on an adult literacy dataset from AutoTutor lessons developed for Adult Reading Comprehension (ARC). We found that: (1) tensor factorization improved the performance in tracing and predicting knowledge mastery compared with other knowledge tracing techniques without data augmentation, showing higher relative fidelity for this imputation method, and (2) the GAN-based simulation showed greater overall stability and less statistical bias based on a divergence evaluation with varying simulation sample sizes compared to GPT.


Optimizing Fantasy Sports Team Selection with Deep Reinforcement Learning

Bhattacharjee, Shamik, Marathe, Kamlesh, Kapoor, Hitesh, Patil, Nilesh

arXiv.org Artificial Intelligence

Fantasy sports, particularly fantasy cricket, have garnered immense popularity in India in recent years, offering enthusiasts the opportunity to engage in strategic team-building and compete based on the real-world performance of professional athletes. In this paper, we address the challenge of optimizing fantasy cricket team selection using reinforcement learning (RL) techniques. By framing the team creation process as a sequential decision-making problem, we aim to develop a model that can adaptively select players to maximize the team's potential performance. Our approach leverages historical player data to train RL algorithms, which then predict future performance and optimize team composition. This not only represents a huge business opportunity by enabling more accurate predictions of high-performing teams but also enhances the overall user experience. Through empirical evaluation and comparison with traditional fantasy team drafting methods, we demonstrate the effectiveness of RL in constructing competitive fantasy teams. Our results show that RL-based strategies provide valuable insights into player selection in fantasy sports.