AITopics | gecturk

Collaborating Authors

gecturk

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Organic Data-Driven Approach for Turkish Grammatical Error Correction and LLMs

Ersoy, Asım, Yıldız, Olcay Taner

arXiv.org Artificial IntelligenceMay-24-2024

Grammatical Error Correction has seen significant progress with the recent advancements in deep learning. As those methods require huge amounts of data, synthetic datasets are being built to fill this gap. Unfortunately, synthetic datasets are not organic enough in some cases and even require clean data to start with. Furthermore, most of the work that has been done is focused mostly on English. In this work, we introduce a new organic data-driven approach, clean insertions, to build parallel Turkish Grammatical Error Correction datasets from any organic data, and to clean the data used for training Large Language Models. We achieve state-of-the-art results on two Turkish Grammatical Error Correction test sets out of the three publicly available ones. We also show the effectiveness of our method on the training losses of training language models.

correction, dataset, evaluation, (12 more...)

arXiv.org Artificial Intelligence

2405.1532

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Washington > King County > Seattle (0.04)
Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)

Genre:

Overview (0.93)
Research Report (0.82)

Industry:

Media > Film (0.49)
Leisure & Entertainment (0.49)

Technology:

Information Technology > Data Science > Data Quality > Data Cleaning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
(2 more...)

Add feedback

GECTurk: Grammatical Error Correction and Detection Dataset for Turkish

Kara, Atakan, Sofian, Farrin Marouf, Bond, Andrew, Şahin, Gözde Gül

arXiv.org Artificial IntelligenceSep-20-2023

Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using this pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic test set by manually annotating a set of movie reviews. We implement three baselines formulating the task as i) neural machine translation, ii) sequence tagging, and iii) prefix tuning with a pretrained decoder-only model, achieving strong results. Furthermore, we perform exhaustive experiments on out-of-domain datasets to gain insights on the transferability and robustness of the proposed approaches. Our results suggest that our corpus, GECTurk, is high-quality and allows knowledge transfer for the out-of-domain setting. To encourage further research on Turkish GEC, we release our datasets, baseline models, and the synthetic data generation pipeline at https://github.com/GGLAB-KU/gecturk.

error correction and detection dataset, gecturk, grammatical error correction

arXiv.org Artificial Intelligence

2309.11346

Genre: Research Report > New Finding (0.53)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.87)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.60)

Add feedback