AITopics

2401.14616

Country:

Asia > Middle East > UAE (0.15)
Europe > Germany (0.14)
Asia > Japan (0.14)

Genre: Research Report (0.64)

Industry: Law > Civil Rights & Constitutional Law (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Communications > Social Media (0.69)

arXiv.org Artificial IntelligenceDec-28-2023

SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling

Kim, Dahyun, Park, Chanjun, Kim, Sanghoon, Lee, Wonsung, Song, Wonho, Kim, Yunsu, Kim, Hyeonwoo, Kim, Yungi, Lee, Hyeonju, Kim, Jihoo, Ahn, Changbae, Yang, Seonghoon, Lee, Sukyung, Park, Hyunbyung, Gim, Gyoungjin, Cha, Mikyoung, Lee, Hwalsuk, Kim, Sunghun

We introduce SOLAR 10.7B, a large language model (LLM) with 10.7 billion parameters, demonstrating superior performance in various natural language processing (NLP) tasks. Inspired by recent efforts to efficiently up-scale LLMs, we present a method for scaling LLMs called depth up-scaling (DUS), which encompasses depthwise scaling and continued pretraining. In contrast to other LLM up-scaling methods that use mixture-of-experts, DUS does not require complex changes to train and inference efficiently. We show experimentally that DUS is simple yet effective in scaling up high-performance LLMs from small ones. Building on the DUS model, we additionally present SOLAR 10.7B-Instruct, a variant fine-tuned for instruction-following capabilities, surpassing Mixtral-8x7B-Instruct. SOLAR 10.7B is publicly available under the Apache 2.0 license, promoting broad access and application in the LLM field.

large language model, machine learning, natural language, (20 more...)

2312.15166

Country: Asia (0.14)

Genre: Research Report (0.50)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceJun-27-2023

Data-Driven Approach for Formality-Sensitive Machine Translation: Language-Specific Handling and Synthetic Data Generation

Lee, Seugnjun, Moon, Hyeonseok, Park, Chanjun, Lim, Heuiseok

In this paper, we introduce a data-driven approach for Formality-Sensitive Machine Translation (FSMT) that caters to the unique linguistic properties of four target languages. Our methodology centers on two core strategies: 1) language-specific data handling, and 2) synthetic data generation using large-scale language models and empirical prompt engineering. This approach demonstrates a considerable improvement over the baseline, highlighting the effectiveness of data-centric techniques. Our prompt engineering strategy further improves performance by producing superior synthetic translation examples.

artificial intelligence, natural language, translation, (13 more...)

2306.14514

Country: North America > United States > Hawaii (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

arXiv.org Artificial IntelligenceJun-26-2023

Knowledge Graph-Augmented Korean Generative Commonsense Reasoning

Jung, Dahyun, Seo, Jaehyung, Lee, Jaewook, Park, Chanjun, Lim, Heuiseok

Generative commonsense reasoning refers to the task of generating acceptable and logical assumptions about everyday situations based on commonsense understanding. By utilizing an existing dataset such as Korean CommonGen, language generation models can learn commonsense reasoning specific to the Korean language. However, language models often fail to consider the relationships between concepts and the deep knowledge inherent to concepts. To address these limitations, we propose a method to utilize the Korean knowledge graph data for text generation. Our experimental result shows that the proposed method can enhance the efficiency of Korean commonsense inference, thereby underlining the significance of employing supplementary data.

artificial intelligence, commonsense reasoning, natural language, (12 more...)

2306.1447

Country:

North America > United States > Pennsylvania (0.15)
North America > United States > Hawaii (0.15)

Genre: Research Report > New Finding (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.69)

Transcending Traditional Boundaries: Leveraging Inter-Annotator Agreement (IAA) for Enhancing Data Management Operations (DMOps)

Kim, Damrin, Kim, NamHyeok, Park, Chanjun, Kim, Harksoo

This paper presents a novel approach of leveraging Inter-Annotator Agreement (IAA), traditionally used for assessing labeling consistency, to optimize Data Management Operations (DMOps). We advocate for the use of IAA in predicting the labeling quality of individual annotators, leading to cost and time efficiency in data production. Additionally, our work highlights the potential of IAA in forecasting document difficulty, thereby boosting the data construction process's overall efficiency. This research underscores IAA's broader application potential in data-driven research optimization and holds significant implications for large-scale data projects prioritizing efficiency, cost reduction, and high-quality data.

data mining, iaa, machine learning, (15 more...)

2306.14374

Country: North America > United States > Hawaii (0.15)

Genre:

Research Report (1.00)
Overview > Innovation (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.48)
Information Technology > Data Science > Data Mining (0.30)

Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction

Park, Chanjun, Koo, Seonmin, Lee, Seolhwa, Seo, Jaehyung, Eo, Sugyeong, Moon, Hyeonseok, Lim, Heuiseok

Data-centric AI approach aims to enhance the model performance without modifying the model and has been shown to impact model performance positively. While recent attention has been given to data-centric AI based on synthetic data, due to its potential for performance improvement, data-centric AI has long been exclusively validated using real-world data and publicly available benchmark datasets. In respect of this, data-centric AI still highly depends on real-world data, and the verification of models using synthetic data has not yet been thoroughly carried out. Given the challenges above, we ask the question: Does data quality control (noise injection and balanced data), a data-centric AI methodology acclaimed to have a positive impact, exhibit the same positive impact in models trained solely with synthetic data? To address this question, we conducted comparative analyses between models trained on synthetic and real-world data based on grammatical error correction (GEC) task. Our experimental results reveal that the data quality control method has a positive impact on models trained with real-world data, as previously reported in existing studies, while a negative impact is observed in models trained solely on synthetic data.

data quality, machine learning, natural language, (14 more...)

2306.14377

Country: North America > United States > Hawaii (0.14)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Data Science > Data Quality > Data Cleaning (0.72)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
(2 more...)

Inter-Annotator Agreement in the Wild: Uncovering Its Emerging Roles and Considerations in Real-World Scenarios

Kim, NamHyeok, Park, Chanjun

Inter-Annotator Agreement (IAA) is commonly used as a measure of label consistency in natural language processing tasks. However, in real-world scenarios, IAA has various roles and implications beyond its traditional usage. In this paper, we not only consider IAA as a measure of consistency but also as a versatile tool that can be effectively utilized in practical applications. Moreover, we discuss various considerations and potential concerns when applying IAA and suggest strategies for effectively navigating these challenges.

annotator, artificial intelligence, natural language, (13 more...)

2306.14373

Country: North America > United States > Hawaii (0.15)

Genre: Research Report (0.51)

Industry: Health & Medicine (0.30)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

DMOps: Data Management Operation and Recipes

Choi, Eujeong, Park, Chanjun

Data-centric AI has shed light on the significance of data within the machine learning (ML) pipeline. Recognizing its significance, academia, industry, and government departments have suggested various NLP data research initiatives. While the ability to utilize existing data is essential, the ability to build a dataset has become more critical than ever, especially in the industry. In consideration of this trend, we propose a "Data Management Operations and Recipes" to guide the industry in optimizing the building of datasets for NLP products. This paper presents the concept of DMOps which is derived from real-world experiences with NLP data management and aims to streamline data operations by offering a baseline.

data quality, machine learning, natural language, (12 more...)

2301.01228

Country: North America > United States > Hawaii (0.14)

Genre: Workflow (0.93)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Quality > Data Cleaning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.49)

arXiv.org Artificial IntelligenceMar-20-2023

Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural Language Processing Leaderboards

Park, Chanjun, Moon, Hyeonseok, Lee, Seolhwa, Seo, Jaehyung, Eo, Sugyeong, Lim, Heuiseok

Leaderboard systems allow researchers to objectively evaluate Natural Language Processing (NLP) models and are typically used to identify models that exhibit superior performance on a given task in a predetermined setting. However, we argue that evaluation on a given test dataset is just one of many performance indications of the model. In this paper, we claim leaderboard competitions should also aim to identify models that exhibit the best performance in a real-world setting. We highlight three issues with current leaderboard systems: (1) the use of a single, static test set, (2) discrepancy between testing and real-world application (3) the tendency for leaderboard-centric competition to be biased towards the test set. As a solution, we propose a new paradigm of leaderboard systems that addresses these issues of current leaderboard system. Through this study, we hope to induce a paradigm shift towards more real -world-centric leaderboard competitions.

artificial intelligence, leaderboard, natural language, (15 more...)

2303.10888

Country: Europe (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

arXiv.org Artificial IntelligenceNov-29-2022

QUAK: A Synthetic Quality Estimation Dataset for Korean-English Neural Machine Translation

Eo, Sugyeong, Park, Chanjun, Moon, Hyeonseok, Seo, Jaehyung, Kim, Gyeongmin, Lee, Jungseob, Lim, Heuiseok

With the recent advance in neural machine translation demonstrating its importance, research on quality estimation (QE) has been steadily progressing. QE aims to automatically predict the quality of machine translation (MT) output without reference sentences. Despite its high utility in the real world, there remain several limitations concerning manual QE data creation: inevitably incurred non-trivial costs due to the need for translation experts, and issues with data scaling and language expansion. To tackle these limitations, we present QUAK, a Korean-English synthetic QE dataset generated in a fully automatic manner. This consists of three sub-QUAK datasets QUAK-M, QUAK-P, and QUAK-H, produced through three strategies that are relatively free from language constraints. Since each strategy requires no human effort, which facilitates scalability, we scale our data up to 1.58M for QUAK-P, H and 6.58M for QUAK-M. As an experiment, we quantitatively analyze word-level QE results in various ways while performing statistical analysis. Moreover, we show that datasets scaled in an efficient way also contribute to performance improvements by observing meaningful performance gains in QUAK-M, P when adding data up to 1.58M.

artificial intelligence, natural language, quak-m, (13 more...)

2209.15285

Genre: Research Report > New Finding (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)