AITopics | Joo, Se June

Collaborating Authors

Joo, Se June

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

Kim, Seungone, Suk, Juyoung, Cho, Ji Yong, Longpre, Shayne, Kim, Chaeeun, Yoon, Dongkeun, Son, Guijin, Cho, Yejin, Shafayat, Sheikh, Baek, Jinheon, Park, Sue Hyun, Hwang, Hyeonbin, Jo, Jinkyung, Cho, Hyowon, Shin, Haebin, Lee, Seongyun, Oh, Hanseok, Lee, Noah, Ho, Namgyu, Joo, Se June, Ko, Miyoung, Lee, Yoonjoo, Chae, Hyungjoo, Shin, Jamin, Jang, Joel, Ye, Seonghyeon, Lin, Bill Yuchen, Welleck, Sean, Neubig, Graham, Lee, Moontae, Lee, Kyungjae, Seo, Minjoon

arXiv.org Artificial IntelligenceJun-9-2024

As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 103 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2406.05761

Country:

Europe (1.00)
North America > United States > Illinois (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education (0.92)
Energy > Power Industry (0.67)
Energy > Renewable > Biofuel (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

Kim, Seungone, Joo, Se June, Kim, Doyoung, Jang, Joel, Ye, Seonghyeon, Shin, Jamin, Seo, Minjoon

arXiv.org Artificial IntelligenceOct-14-2023

Language models (LMs) with less than 100B parameters are known to perform poorly on chain-of-thought (CoT) reasoning in contrast to large LMs when solving unseen tasks. In this work, we aim to equip smaller LMs with the step-by-step reasoning capability by instruction tuning with CoT rationales. In order to achieve this goal, we first introduce a new instruction-tuning dataset called the CoT Collection, which augments the existing Flan Collection (including only 9 CoT tasks) with additional 1.84 million rationales across 1,060 tasks. We show that CoT fine-tuning Flan-T5 (3B & 11B) with CoT Collection enables smaller LMs to have better CoT capabilities on unseen tasks. On the BIG-Bench-Hard (BBH) benchmark, we report an average improvement of +4.34% (Flan-T5 3B) and +2.60% (Flan-T5 11B), in terms of zero-shot task accuracy. Furthermore, we show that instruction tuning with CoT Collection allows LMs to possess stronger few-shot learning capabilities on 4 domain-specific tasks, resulting in an improvement of +2.24% (Flan-T5 3B) and +2.37% (Flan-T5 11B), even outperforming ChatGPT utilizing demonstrations until the max length by a +13.98% margin. Our code, the CoT Collection data, and model checkpoints are publicly available.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2305.14045

Country: North America > United States (0.93)

Genre:

Research Report (0.82)
Personal > Interview (0.46)

Industry:

Leisure & Entertainment (0.68)
Law (0.68)
Education (0.67)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

CoTEVer: Chain of Thought Prompting Annotation Toolkit for Explanation Verification

Kim, Seungone, Joo, Se June, Jang, Yul, Chae, Hyungjoo, Yeo, Jinyoung

arXiv.org Artificial IntelligenceMar-6-2023

Chain-of-thought (CoT) prompting enables large language models (LLMs) to solve complex reasoning tasks by generating an explanation before the final prediction. Despite it's promising ability, a critical downside of CoT prompting is that the performance is greatly affected by the factuality of the generated explanation. To improve the correctness of the explanations, fine-tuning language models with explanation data is needed. However, there exists only a few datasets that can be used for such approaches, and no data collection tool for building them. Thus, we introduce CoTEVer, a tool-kit for annotating the factual correctness of generated explanations and collecting revision data of wrong explanations. Figure 1: Example of Explanation Verification and Answer Furthermore, we suggest several use cases Verification of GPT-3's output. Explanation Verification where the data collected with CoTEVer can requires additional knowledge which makes it be utilized for enhancing the faithfulness of hard for annotators to intuitively write a revised explanation explanations. Our toolkit is publicly available and answer.

explanation, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2303.03628

Country: North America > United States (0.70)

Genre:

Workflow (0.46)
Research Report (0.40)

Industry:

Health & Medicine (0.68)
Automobiles & Trucks > Manufacturer (0.30)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback