proficiency
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency
Guo, Jiacheng, Huang, Suozhi, Yao, Zixin, Zhang, Yifan, Lu, Yifu, Liu, Jiashuo, Li, Zihao, Deng, Nicholas, Xiao, Qixin, Tian, Jia, Zhan, Kanghong, Li, Tianyi, Liu, Xiaochen, Ge, Jason, He, Chaoyang, Huang, Kaixuan, Yang, Lin, Huang, Wenhao, Wang, Mengdi
This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emph{extreme time-sensitivity}, \emph{a highly adversarial information environment}, and the critical need to synthesize data from \emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Michigan (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (2 more...)
A Definition of AGI
Hendrycks, Dan, Song, Dawn, Szegedy, Christian, Lee, Honglak, Gal, Yarin, Brynjolfsson, Erik, Li, Sharon, Zou, Andy, Levine, Lionel, Han, Bo, Fu, Jie, Liu, Ziwei, Shin, Jinwoo, Lee, Kimin, Mazeika, Mantas, Phan, Long, Ingebretsen, George, Khoja, Adam, Xie, Cihang, Salaudeen, Olawale, Hein, Matthias, Zhao, Kevin, Pan, Alexander, Duvenaud, David, Li, Bo, Omohundro, Steve, Alfour, Gabriel, Tegmark, Max, McGrew, Kevin, Marcus, Gary, Tallinn, Jaan, Schmidt, Eric, Bengio, Yoshua
The lack of a concrete definition for Artificial General Intelligence (AGI) obscures the gap between today's specialized AI and human-level cognition. This paper introduces a quantifiable framework to address this, defining AGI as matching the cognitive versatility and proficiency of a well-educated adult. To operationalize this, we ground our methodology in Cattell-Horn-Carroll theory, the most empirically validated model of human cognition. The framework dissects general intelligence into ten core cognitive domains-including reasoning, memory, and perception-and adapts established human psychometric batteries to evaluate AI systems. Application of this framework reveals a highly "jagged" cognitive profile in contemporary models. While proficient in knowledge-intensive domains, current AI systems have critical deficits in foundational cognitive machinery, particularly long-term memory storage. The resulting AGI scores (e.g., GPT-4 at 27%, GPT-5 at 57%) concretely quantify both rapid progress and the substantial gap remaining before AGI.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Asia > China > Shanghai > Shanghai (0.04)
- (22 more...)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
- Education (1.00)
- (2 more...)
A Coherence-Based Measure of AGI
Recent approaches to evaluating Artificial General Intelligence (AGI) typically summarize a system's capability using the arithmetic mean of its proficiencies across multiple cognitive domains. While simple, this implicitly assumes compensability: exceptional performance in some areas can offset severe deficiencies in others. Genuine general intelligence, however, requires coherent sufficiency: balanced competence across all essential faculties. We introduce a coherence-based measure of AGI that integrates the generalized mean over a continuum of compensability exponents. This yields an area-under-the-curve (AUC) metric spanning arithmetic, geometric, and harmonic regimes, quantifying how robust an evaluated capability remains as compensability assumptions become stricter. Unlike the arithmetic mean, which rewards specialization, the AUC penalizes imbalance and exposes bottlenecks that constrain performance. To illustrate the framework, we apply it to cognitive profiles derived from the Cattell-Horn-Carroll (CHC) model, showing how coherence-based aggregation highlights imbalances that are obscured by arithmetic averaging. As a second, independent example, we apply the same methodology to a set of 17 heterogeneous benchmarks, demonstrating how coherence-based evaluation can reveal unevenness even in narrower task collections. These examples show that the proposed approach offers a principled, interpretable, and stricter foundation for measuring progress toward AGI.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Virginia (0.04)
- Europe > Estonia > Harju County > Tallinn (0.04)
A Model Architecture
In this section, we provide comprehensive details about the Transformer model architectures considered in this work. Experiments conducted on both DMLab and RoboMimic include RGB image observations. Detailed model parameters are listed in Table A.1. We learn 16-dim embedding vectors for all discrete actions. To encode proprioceptive measurement in RoboMimic, we follow Mandlekar et al.
MetaGDPO: Alleviating Catastrophic Forgetting with Metacognitive Knowledge through Group Direct Preference Optimization
Zhang, Lanxue, Xie, Yuqiang, Fang, Fang, Dong, Fanglong, Liu, Rui, Cao, Yanan
Large Language Models demonstrate strong reasoning capabilities, which can be effectively compressed into smaller models. However, existing datasets and fine-tuning approaches still face challenges that lead to catastrophic forgetting, particularly for models smaller than 8B. First, most datasets typically ignore the relationship between training data knowledge and the model's inherent abilities, making it difficult to preserve prior knowledge. Second, conventional training objectives often fail to constrain inherent knowledge preservation, which can result in forgetting of previously learned skills. To address these issues, we propose a comprehensive solution that alleviates catastrophic forgetting from both the data and fine-tuning approach perspectives. On the data side, we construct a dataset of 5K instances that covers multiple reasoning tasks and incorporates metacognitive knowledge, making it more tolerant and effective for distillation into smaller models. We annotate the metacognitive knowledge required to solve each question and filter the data based on task knowledge and the model's inherent skills. On the training side, we introduce GDPO (Group Direction Preference Optimization), which is better suited for resource-limited scenarios and can efficiently approximate the performance of GRPO. Guided by the large model and by implicitly constraining the optimization path through a reference model, GDPO enables more effective knowledge transfer from the large model and constrains excessive parameter drift. Extensive experiments demonstrate that our approach significantly alleviates catastrophic forgetting and improves reasoning performance on smaller models.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Beijing > Beijing (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (7 more...)
- Education (1.00)
- Information Technology > Security & Privacy (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Zero-Shot Grammar Competency Estimation Using Large Language Model Generated Pseudo Labels
Das, Sourya Dipta, Kumar, Shubham, Yadav, Kuldeep
Grammar competency estimation is essential for assessing linguistic proficiency in both written and spoken language; however, the spoken modality presents additional challenges due to its spontaneous, unstructured, and disfluent nature. Developing accurate grammar scoring models further requires extensive expert annotation, making large-scale data creation impractical. To address these limitations, we propose a zero-shot grammar competency estimation framework that leverages unlabeled data and Large Language Models (LLMs) without relying on manual labels. During training, we employ LLM-generated predictions on unlabeled data by using grammar competency rubric-based prompts. These predictions, treated as pseudo labels, are utilized to train a transformer-based model through a novel training framework designed to handle label noise effectively. We show that the choice of LLM for pseudo-label generation critically affects model performance and that the ratio of clean-to-noisy samples during training strongly influences stability and accuracy. Finally, a qualitative analysis of error intensity and score prediction confirms the robustness and interpretability of our approach. Experimental results demonstrate the efficacy of our approach in estimating grammar competency scores with high accuracy, paving the way for scalable, low-resource grammar assessment systems.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > India (0.04)
- Asia > China > Anhui Province > Hefei (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Michigan (0.04)
Model Proficiency in Centralized Multi-Agent Systems: A Performance Study
Guerra, Anna, Guidi, Francesco, Closas, Pau, Dardari, Davide, Djuric, Petar M.
Autonomous agents are increasingly deployed in dynamic environments where their ability to perform a given task depends on both individual and team-level proficiency. While proficiency self-assessment (PSA) has been studied for single agents, its extension to a team of agents remains underexplored. This letter addresses this gap by presenting a framework for team PSA in centralized settings. We investigate three metrics for centralized team PSA: the measurement prediction bound (MPB), the Kolmogorov-Smirnov (KS) statistic, and the Kullback-Leibler (KL) divergence. These metrics quantify the discrepancy between predicted and actual measurements. We use the KL divergence as a reference metric since it compares the true and predictive distributions, whereas the MPB and KS provide efficient indicators for in situ assessment. Simulation results in a target tracking scenario demonstrate that both MPB and KS metrics accurately capture model mismatches, align with the KL divergence reference, and enable real-time proficiency assessment.
- North America > United States > New York > Suffolk County > Stony Brook (0.04)
- Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features
Sung, Hakyung, Csuros, Karla, Sung, Min-Chang
This study examines the lexical and syntactic interventions of human and LLM proofreading aimed at improving overall intelligibility in identical second language writings, and evaluates the consistency of outcomes across three LLMs (ChatGPT-4o, Llama3.1-8b, Deepseek-r1-8b). Findings show that both human and LLM proofreading enhance bigram lexical features, which may contribute to better coherence and contextual connectedness between adjacent words. However, LLM proofreading exhibits a more generative approach, extensively reworking vocabulary and sentence structures, such as employing more diverse and sophisticated vocabulary and incorporating a greater number of adjective modifiers in noun phrases. The proofreading outcomes are highly consistent in major lexical and syntactic features across the three models.
- North America > United States > Oregon (0.40)
- Asia > Taiwan (0.04)
- Asia > Japan (0.04)
- (13 more...)