AITopics | gpt4-turbo 0

Collaborating Authors

gpt4-turbo 0

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance

Etzine, Bryan, Hashemi, Masoud, Madhusudhan, Nishanth, Davasam, Sagar, Sharma, Roshnee, Madhusudhan, Sathwik Tejaswi, Yadav, Vikas

arXiv.org Artificial IntelligenceMar-7-2025

Existing benchmarks are becoming saturated and struggle to separate model performances due to factors like data contamination and advancing LLM capabilities. This paper introduces EMDM (Enhanced Model Differentiation Metric), a novel weighted metric that revitalizes benchmarks by enhancing model separation. EMDM integrates final answer and Chain-of-Thought (CoT) reasoning correctness, assigning weights based on the complexity and reasoning depth required to solve a given sample in the evaluation data. Using a baseline LLM in two setups-Unguided, where the model has no prior exposure to test samples, and Guided, where the model has prior knowledge of the desired answer-EMDM distinguishes instances of varying difficulty. The CoT and answer correctness from these setups inform an optimization objective for weight assignment, resulting in a more nuanced evaluation of model performance. Compared to the exact match (EM) metric, which achieves 17% separation on ARC-Challenge, EMDM achieves 46%, demonstrating its effectiveness in differentiating models based on reasoning and knowledge requirements.

emdm, gpt4-turbo 0, reasoning, (16 more...)

arXiv.org Artificial Intelligence

2503.05551

Country:

North America > United States (0.15)
Asia > Thailand > Bangkok > Bangkok (0.04)

Genre: Research Report (0.82)

Industry:

Law > Statutes (1.00)
Government (1.00)
Law > Civil Rights & Constitutional Law (0.96)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

CausalBench: A Comprehensive Benchmark for Causal Learning Capability of Large Language Models

Zhou, Yu, Wu, Xingyu, Huang, Beicheng, Wu, Jibin, Feng, Liang, Tan, Kay Chen

arXiv.org Artificial IntelligenceApr-9-2024

Causality reveals fundamental principles behind data distributions in real-world scenarios, and the capability of large language models (LLMs) to understand causality directly impacts their efficacy across explaining outputs, adapting to new evidence, and generating counterfactuals. With the proliferation of LLMs, the evaluation of this capacity is increasingly garnering attention. However, the absence of a comprehensive benchmark has rendered existing evaluation studies being straightforward, undiversified, and homogeneous. To address these challenges, this paper proposes a comprehensive benchmark, namely CausalBench, to evaluate the causality understanding capabilities of LLMs. Originating from the causal research community, CausalBench encompasses three causal learning-related tasks, which facilitate a convenient comparison of LLMs' performance with classic causal learning algorithms. Meanwhile, causal networks of varying scales and densities are integrated in CausalBench, to explore the upper limits of LLMs' capabilities across task scenarios of varying difficulty. Notably, background knowledge and structured data are also incorporated into CausalBench to thoroughly unlock the underlying potential of LLMs for long-text comprehension and prior information utilization. Based on CausalBench, this paper evaluates nineteen leading LLMs and unveils insightful conclusions in diverse aspects. Firstly, we present the strengths and weaknesses of LLMs and quantitatively explore the upper limits of their capabilities across various scenarios. Meanwhile, we further discern the adaptability and abilities of LLMs to specific structural networks and complex chain of thought structures. Moreover, this paper quantitatively presents the differences across diverse information sources and uncovers the gap between LLMs' capabilities in causal understanding within textual contexts and numerical domains.

dataset, llm, opt6d7b internlm7b internlm20b falcon7b gpt3, (11 more...)

arXiv.org Artificial Intelligence

2404.06349

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > China > Hong Kong (0.04)
Asia > China > Chongqing Province > Chongqing (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.43)

Add feedback