AITopics | Järviniemi, Olli

Collaborating Authors

Järviniemi, Olli

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Glazer, Elliot, Erdil, Ege, Besiroglu, Tamay, Chicharro, Diego, Chen, Evan, Gunning, Alex, Olsson, Caroline Falkman, Denain, Jean-Stanislas, Ho, Anson, Santos, Emily de Oliveira, Järviniemi, Olli, Barnett, Matthew, Sandler, Robert, Vrzala, Matej, Sevilla, Jaime, Ren, Qiuyu, Pratt, Elizabeth, Levine, Lionel, Barkley, Grant, Stewart, Natalie, Grechuk, Bogdan, Grechuk, Tetiana, Enugandla, Shreepranav Varma, Wildon, Mark

arXiv.org Artificial IntelligenceDec-19-2024

Recent AI systems have demonstrated remarkable proficiency in tackling challenging mathematical tasks, from achieving olympiad-level performance in geometry (Trinh et al. 2024) to improving upon existing research results in combinatorics (Romera-Paredes et al. 2024). However, existing benchmarks face some limitations: Saturation of existing benchmarks Current standard mathematics benchmarks such as the MATH dataset (Hendrycks, Burns, Kadavath, et al. 2021) and GSM8K (Cobbe et al. 2021) primarily assess competency at the high-school and early undergraduate level. As state-of-the-art models achieve near-perfect performance on these benchmarks, we lack rigorous ways to evaluate their capabilities in advanced mathematical domains that require deeper theoretical understanding, creative insight, and specialized expertise. Furthermore, to assess AI's potential contributions to mathematics research, we require benchmarks that better reflect the challenges faced by working mathematicians. Benchmark contamination in training data A significant challenge in evaluating large language models (LLMs) is data contamination--the inadvertent inclusion of benchmark problems in training data.

benchmark, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2411.04872

Country: Europe (0.28)

Genre: Research Report > Promising Solution (0.48)

Industry: Education > Educational Setting (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Järviniemi, Olli, Hubinger, Evan

arXiv.org Artificial IntelligenceApr-25-2024

We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees provide tasks for the assistant to complete, these tasks spanning writing assistance, information retrieval and programming. We then introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so. Across different scenarios, we find that Claude 3 Opus 1) complies with a task of mass-generating comments to influence public perception of the company, later deceiving humans about it having done so, 2) lies to auditors when asked questions, and 3) strategically pretends to be less capable than it is during capability evaluations. Our work demonstrates that even models trained to be helpful, harmless and honest sometimes behave deceptively in realistic scenarios, without notable external pressure to do so.

completion, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2405.01576

Genre: Research Report > New Finding (0.67)

Industry:

Automobiles & Trucks (1.00)
Transportation > Ground > Road (0.95)
Transportation > Electric Vehicle (0.95)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.61)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback