AITopics

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
Europe > Bulgaria (0.04)
(4 more...)

Genre: Overview (0.45)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Neural Information Processing SystemsFeb-14-2026, 12:09:05 GMT

5c1019b5711474ae5627dc8580614e01-Paper-Datasets_and_Benchmarks_Track.pdf

benchmark, large language model, machine learning, (20 more...)

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
Europe > Bulgaria (0.04)
(9 more...)

Genre:

Research Report (0.67)
Overview (0.45)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Data Science (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
(2 more...)

arXiv.org Artificial IntelligenceOct-29-2025

Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts

Song, Seyoung, Kim, Nawon, Chae, Songeun, Park, Kiwoong, Jin, Jiho, Yoo, Haneul, Cho, Kyunghyun, Oh, Alice

The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea ' s lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.

large language model, machine learning, natural language, (17 more...)

2510.24541

Country:

Europe (1.00)
Asia > North Korea (1.00)

Genre: Research Report (0.64)

Industry: Media > News (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.34)

Neural Information Processing SystemsOct-10-2025, 03:43:45 GMT

5c1019b5711474ae5627dc8580614e01-Supplemental-Datasets_and_Benchmarks_Track.pdf

collection requirement, qwen1, template explanation, (14 more...)

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
Asia > China > Yunnan Province > Kunming (0.04)
(5 more...)

Genre: Overview (0.45)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Neural Information Processing SystemsOct-10-2025, 03:43:42 GMT

WenMind: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Classical Literature and Language Arts

Large Language Models (LLMs) have made significant advancements across numerous domains, but their capabilities in Chinese Classical Literature and Language Arts (CCLLA) remain largely unexplored due to the limited scope and tasks of existing benchmarks. To fill this gap, we propose WenMind, a comprehensive benchmark dedicated for evaluating LLMs in CCLLA.

benchmark, llm, qwen1, (14 more...)

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
Europe > Bulgaria (0.04)
(9 more...)

Genre:

Research Report (0.67)
Overview (0.45)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

arXiv.org Artificial IntelligenceJun-19-2025

Efficiently Building a Domain-Specific Large Language Model from Scratch: A Case Study of a Classical Chinese Large Language Model

Li, Shen, Hu, Renfen, Wang, Lijun

General-purpose large language models demonstrate notable capabilities in language comprehension and generation, achieving results that are comparable to, or even surpass, human performance in many natural language processing tasks. Nevertheless, when general models are applied to some specific domains, e.g., Classical Chinese texts, their effectiveness is often unsatisfactory, and fine-tuning open-source foundational models similarly struggles to adequately incorporate domain-specific knowledge. To address this challenge, this study developed a large language model, AI Taiyan, specifically designed for understanding and generating Classical Chinese. Experiments show that with a reasonable model design, data processing, foundational training, and fine-tuning, satisfactory results can be achieved with only 1.8 billion parameters. In key tasks related to language processing of Classical Chinese such as punctuation, identification of allusions, explanation of word meanings, and translation between ancient and modern Chinese, this model exhibits a clear advantage over both general-purpose large models and domain-specific traditional models, achieving levels close to or surpassing human baselines. This research provides a reference for the efficient construction of specialized domain-specific large language models. Furthermore, the paper discusses the application of this model in fields such as the collation of ancient texts, dictionary editing, and language research, combined with case studies.

large language model, machine learning, natural language, (18 more...)

2505.1181

Genre: Research Report > New Finding (0.93)

Industry:

Education > Educational Setting (0.47)
Government (0.46)
Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

arXiv.org Artificial IntelligenceApr-30-2025

WenyanGPT: A Large Language Model for Classical Chinese Tasks

Yao, Xinyu, Wang, Mengdi, Chen, Bo, Zhao, Xiaobing

Classical Chinese, as the core carrier of Chinese culture, plays a crucial role in the inheritance and study of ancient literature. However, existing natural language processing models primarily optimize for Modern Chinese, resulting in inadequate performance on Classical Chinese. This paper presents a comprehensive solution for Classical Chinese language processing. By continuing pre-training and instruction fine-tuning on the LLaMA3-8B-Chinese model, we construct a large language model, WenyanGPT, which is specifically designed for Classical Chinese tasks. Additionally, we develop an evaluation benchmark dataset, WenyanBENCH. Experimental results on WenyanBENCH demonstrate that WenyanGPT significantly outperforms current advanced LLMs in various Classical Chinese tasks. We make the model's training data, instruction fine-tuning data\footnote, and evaluation benchmark dataset publicly available to promote further research and development in the field of Classical Chinese processing.

large language model, machine learning, natural language, (20 more...)

2504.20609

Country:

Asia > China (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMar-20-2025

F\`ux\`i: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation

Zhao, Shangqing, Zhou, Yuhao, Ren, Yupei, Chen, Zhe, Jia, Chenghao, Zhe, Fang, Long, Zhaogaung, Liu, Shu, Lan, Man

Ancient Chinese text processing presents unique challenges for large language models (LLMs) due to its distinct linguistic features, complex structural constraints, and rich cultural context. While existing benchmarks have primarily focused on evaluating comprehension through multiple-choice questions, there remains a critical gap in assessing models' generative capabilities in classical Chinese. We introduce F\`ux\`i, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks. Our benchmark distinguishes itself through three key contributions: (1) balanced coverage of both comprehension and generation tasks, including novel tasks like poetry composition and couplet completion, (2) specialized evaluation metrics designed specifically for classical Chinese text generation, combining rule-based verification with fine-tuned LLM evaluators, and (3) a systematic assessment framework that considers both linguistic accuracy and cultural authenticity. Through extensive evaluation of state-of-the-art LLMs, we reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks, particularly those requiring deep cultural knowledge and adherence to classical formats. Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development. The benchmark, evaluation toolkit, and baseline results are publicly available to facilitate research in this domain.

large language model, machine learning, natural language, (22 more...)

2503.15837

Country:

Europe > Switzerland (0.04)
Asia > Singapore (0.04)
North America > United States > New York > New York County > New York City (0.04)
(6 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Education (1.00)
Health & Medicine > Consumer Health (0.46)
Health & Medicine > Therapeutic Area > Endocrinology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

arXiv.org Artificial IntelligenceNov-7-2024

When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun

Song, Seyoung, Yoo, Haneul, Jin, Jiho, Cho, Kyunghyun, Oh, Alice

Historical and linguistic connections within the Sinosphere have led researchers to use Classical Chinese resources for cross-lingual transfer when processing historical documents from Korea and Japan. In this paper, we question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun, the ancient written languages of Korea and Japan, respectively. Our experiments across machine translation, named entity recognition, and punctuation restoration tasks show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja, with performance differences within $\pm{}0.0068$ F1-score for sequence labeling tasks and up to $+0.84$ BLEU score for translation. These limitations persist consistently across various model sizes, architectures, and domain-specific datasets. Our analysis reveals that the benefits of Classical Chinese resources diminish rapidly as local language data increases for Hanja, while showing substantial improvements only in extremely low-resource scenarios for both Korean and Japanese historical documents. These mixed results emphasize the need for careful empirical validation rather than assuming benefits from indiscriminate cross-lingual transfer.

classical chinese, computational linguistic, translation, (15 more...)

2411.04822

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Vietnam (0.04)
North America > Canada > Ontario > Toronto (0.04)
(12 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceJul-4-2024

TongGu: Mastering Classical Chinese Understanding with Knowledge-Grounded Large Language Models

Cao, Jiahuan, Peng, Dezhi, Zhang, Peirong, Shi, Yongxin, Liu, Yang, Ding, Kai, Jin, Lianwen

Classical Chinese is a gateway to the rich heritage and wisdom of ancient China, yet its complexities pose formidable comprehension barriers for most modern people without specialized knowledge. While Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), they struggle with Classical Chinese Understanding (CCU), especially in data-demanding and knowledge-intensive tasks. In response to this dilemma, we propose \textbf{TongGu} (mean understanding ancient and modern), the first CCU-specific LLM, underpinned by three core contributions. First, we construct a two-stage instruction-tuning dataset ACCN-INS derived from rich classical Chinese corpora, aiming to unlock the full CCU potential of LLMs. Second, we propose Redundancy-Aware Tuning (RAT) to prevent catastrophic forgetting, enabling TongGu to acquire new capabilities while preserving its foundational knowledge. Third, we present a CCU Retrieval-Augmented Generation (CCU-RAG) technique to reduce hallucinations based on knowledge-grounding. Extensive experiments across 24 diverse CCU tasks validate TongGu's superior ability, underscoring the effectiveness of RAT and CCU-RAG. The model and dataset will be public available.

classical chinese, dataset, tonggu, (15 more...)

2407.03937

Country:

Asia > China (0.24)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)