Not enough data to create a plot.
Try a different view from the menu above.
Lee, Hanwool
(G)I-DLE: Generative Inference via Distribution-preserving Logit Exclusion with KL Divergence Minimization for Constrained Decoding
Lee, Hanwool
We propose (G)I-DLE, a new approach to constrained decoding that leverages KL divergence minimization to preserve the intrinsic conditional probability distribution of autoregressive language models while excluding undesirable tokens. Unlike conventional methods that naively set banned tokens' logits to $-\infty$, which can distort the conversion from raw logits to posterior probabilities and increase output variance, (G)I-DLE re-normalizes the allowed token probabilities to minimize such distortion. We validate our method on the K2-Eval dataset, specifically designed to assess Korean language fluency, logical reasoning, and cultural appropriateness. Experimental results on Qwen2.5 models (ranging from 1.5B to 14B) demonstrate that G-IDLE not only boosts mean evaluation scores but also substantially reduces the variance of output quality.
TWICE: What Advantages Can Low-Resource Domain-Specific Embedding Model Bring? - A Case Study on Korea Financial Texts
Hwang, Yewon, Jung, Sungbum, Lee, Hanwool, Yu, Sara
Domain specificity of embedding models is critical for effective performance. However, existing benchmarks, such as FinMTEB, are primarily designed for high-resource languages, leaving low-resource settings, such as Korean, under-explored. Directly translating established English benchmarks often fails to capture the linguistic and cultural nuances present in low-resource domains. In this paper, titled TWICE: What Advantages Can Low-Resource Domain-Specific Embedding Models Bring? A Case Study on Korea Financial Texts, we introduce KorFinMTEB, a novel benchmark for the Korean financial domain, specifically tailored to reflect its unique cultural characteristics in low-resource languages. Our experimental results reveal that while the models perform robustly on a translated version of FinMTEB, their performance on KorFinMTEB uncovers subtle yet critical discrepancies, especially in tasks requiring deeper semantic understanding, that underscore the limitations of direct translation. This discrepancy highlights the necessity of benchmarks that incorporate language-specific idiosyncrasies and cultural nuances. The insights from our study advocate for the development of domain-specific evaluation frameworks that can more accurately assess and drive the progress of embedding models in low-resource settings.
ML-Promise: A Multilingual Dataset for Corporate Promise Verification
Seki, Yohei, Shu, Hakusen, Lhuissier, Anaรฏs, Lee, Hanwool, Kang, Juyeon, Day, Min-Yuh, Chen, Chung-Chi
Promises made by politicians, corporate leaders, and public figures have a significant impact on public perception, trust, and institutional reputation. However, the complexity and volume of such commitments, coupled with difficulties in verifying their fulfillment, necessitate innovative methods for assessing their credibility. This paper introduces the concept of Promise Verification, a systematic approach involving steps such as promise identification, evidence assessment, and the evaluation of timing for verification. We propose the first multilingual dataset, ML-Promise, which includes English, French, Chinese, Japanese, and Korean, aimed at facilitating in-depth verification of promises, particularly in the context of Environmental, Social, and Governance (ESG) reports. Given the growing emphasis on corporate environmental contributions, this dataset addresses the challenge of evaluating corporate promises, especially in light of practices like greenwashing. Our findings also explore textual and image-based baselines, with promising results from retrieval-augmented generation (RAG) approaches. This work aims to foster further discourse on the accountability of public commitments across multiple languages and domains.
KMMLU: Measuring Massive Multitask Language Understanding in Korean
Son, Guijin, Lee, Hanwool, Kim, Sungdong, Kim, Seungone, Muennighoff, Niklas, Choi, Taekyoon, Park, Cheonbok, Yoo, Kang Min, Biderman, Stella
We propose KMMLU, a new Korean benchmark with 35,030 expert-level multiple-choice questions across 45 subjects ranging from humanities to STEM. While prior Korean benchmarks are translated from existing English benchmarks, KMMLU is collected from original Korean exams, capturing linguistic and cultural aspects of the Korean language. We test 27 public and proprietary LLMs and observe the best public model to score 50.5%, leaving significant room for improvement. This model was primarily trained for English and Chinese, not Korean. Current LLMs tailored to Korean, such as Polyglot-Ko, perform far worse. Surprisingly, even the most capable proprietary LLMs, e.g., GPT-4 and HyperCLOVA X do not exceed 60%. This suggests that further work is needed to improve LLMs for Korean, and we believe KMMLU offers the appropriate tool to track this progress. We make our dataset publicly available on the Hugging Face Hub and integrate the benchmark into EleutherAI's Language Model Evaluation Harness.
HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models
Son, Guijin, Lee, Hanwool, Kim, Suwan, Kim, Huiseo, Lee, Jaecheol, Yeom, Je Won, Jung, Jihyu, Kim, Jung Woo, Kim, Songseong
Large Language Models (LLMs) trained on massive corpora demonstrate impressive capabilities in a wide range of tasks. While there are ongoing efforts to adapt these models to languages beyond English, the attention given to their evaluation methodologies remains limited. Current multilingual benchmarks often rely on back translations or re-implementations of English tests, limiting their capacity to capture unique cultural and linguistic nuances. To bridge this gap for the Korean language, we introduce HAE-RAE Bench, a dataset curated to challenge models lacking Korean cultural and contextual depth. The dataset encompasses six downstream tasks across four domains: vocabulary, history, general knowledge, and reading comprehension. Contrary to traditional evaluation suites focused on token or sequence classification and specific mathematical or logical reasoning, HAE-RAE Bench emphasizes a model's aptitude for recalling Korean-specific knowledge and cultural contexts. Comparative analysis with prior Korean benchmarks indicates that the HAE-RAE Bench presents a greater challenge to non-native models, by disturbing abilities and knowledge learned from English being transferred.
EaSyGuide : ESG Issue Identification Framework leveraging Abilities of Generative Large Language Models
Lee, Hanwool, Choi, Jonghyun, Kwon, Sohyeon, Jung, Sungbum
This paper presents our participation in the FinNLP-2023 shared task on multi-lingual environmental, social, and corporate governance issue identification (ML-ESG). The task's objective is to classify news articles based on the 35 ESG key issues defined by the MSCI ESG rating guidelines. Our approach focuses on the English and French subtasks, employing the CerebrasGPT, OPT, and Pythia models, along with the zero-shot and GPT3Mix Augmentation techniques. We utilize various encoder models, such as RoBERTa, DeBERTa, and FinBERT, subjecting them to knowledge distillation and additional training. Our approach yielded exceptional results, securing the first position in the English text subtask with F1-score 0.69 and the second position in the French text subtask with F1-score 0.78. These outcomes underscore the effectiveness of our methodology in identifying ESG issues in news articles across different languages. Our findings contribute to the exploration of ESG topics and highlight the potential of leveraging advanced language models for ESG issue identification.
Removing Non-Stationary Knowledge From Pre-Trained Language Models for Entity-Level Sentiment Classification in Finance
Son, Guijin, Lee, Hanwool, Kang, Nahyeon, Hahm, Moonjeong
Extraction of sentiment signals from news text, stock message boards, and business reports, for stock movement prediction, has been a rising field of interest in finance. Building upon past literature, the most recent works attempt to better capture sentiment from sentences with complex syntactic structures by introducing aspect-level sentiment classification (ASC). Despite the growing interest, however, fine-grained sentiment analysis has not been fully explored in non-English literature due to the shortage of annotated finance-specific data. Accordingly, it is necessary for non-English languages to leverage datasets and pre-trained language models (PLM) of different domains, languages, and tasks to best their performance. To facilitate finance-specific ASC research in the Korean language, we build KorFinASC, a Korean aspect-level sentiment classification dataset for finance consisting of 12,613 human-annotated samples, and explore methods of intermediate transfer learning. Our experiments indicate that past research has been ignorant towards the potentially wrong knowledge of financial entities encoded during the training phase, which has overestimated the predictive power of PLMs. In our work, we use the term "non-stationary knowledge'' to refer to information that was previously correct but is likely to change, and present "TGT-Masking'', a novel masking pattern to restrict PLMs from speculating knowledge of the kind. Finally, through a series of transfer learning with TGT-Masking applied we improve 22.63% of classification accuracy compared to standalone models on KorFinASC.