AITopics | native language identification

Collaborating Authors

native language identification

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Robust Native Language Identification through Agentic Decomposition

Uluslu, Ahmet Yavuz, Kew, Tannon, Ellendorff, Tilia, Schneider, Gerold, Sennrich, Rico

arXiv.org Artificial IntelligenceSep-23-2025

Large language models (LLMs) often achieve high performance in native language identification (NLI) benchmarks by leveraging superficial contextual clues such as names, locations, and cultural stereotypes, rather than the underlying linguistic patterns indicative of native language (L1) influence. To improve robustness, previous work has instructed LLMs to disregard such clues. In this work, we demonstrate that such a strategy is unreliable and model predictions can be easily altered by misleading hints. To address this problem, we introduce an agentic NLI pipeline inspired by forensic linguistics, where specialized agents accumulate and categorize diverse linguistic evidence before an independent final overall assessment. In this final assessment, a goal-aware coordinating agent synthesizes all evidence to make the NLI prediction. On two benchmark datasets, our approach significantly enhances NLI robustness against misleading contextual clues and performance consistency compared to standard prompting methods.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2509.16666

Country: Europe (1.00)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.94)

Add feedback

Leveraging Open-Source Large Language Models for Native Language Identification

Ng, Yee Man, Markov, Ilia

arXiv.org Artificial IntelligenceSep-15-2024

Native Language Identification (NLI) - the task of identifying the native language (L1) of a person based on their writing in the second language (L2) - has applications in forensics, marketing, and second language acquisition. Historically, conventional machine learning approaches that heavily rely on extensive feature engineering have outperformed transformer-based language models on this task. Recently, closed-source generative large language models (LLMs), e.g., GPT-4, have demonstrated remarkable performance on NLI in a zero-shot setting, including promising results in open-set classification. However, closed-source LLMs have many disadvantages, such as high costs and undisclosed nature of training data. This study explores the potential of using open-source LLMs for NLI. Our results indicate that open-source LLMs do not reach the accuracy levels of closed-source LLMs when used out-of-the-box. However, when fine-tuned on labeled training data, open-source LLMs can achieve performance comparable to that of commercial LLMs.

computational linguistic, llm, open-source llm, (12 more...)

arXiv.org Artificial Intelligence

2409.09659

Country:

Europe > Netherlands > North Holland > Amsterdam (0.05)
North America > Mexico > Mexico City > Mexico City (0.05)
North America > United States > New York > New York County > New York City (0.04)
(14 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Native Language Identification with Large Language Models

Zhang, Wei, Salle, Alexandre

arXiv.org Artificial IntelligenceDec-12-2023

We present the first experiments on Native Language Identification (NLI) using LLMs such as GPT-4. NLI is the task of predicting a writer's first language by analyzing their writings in a second language, and is used in second language acquisition and forensic linguistics. Our results show that GPT models are proficient at NLI classification, with GPT-4 setting a new performance record of 91.7% on the benchmark TOEFL11 test set in a zero-shot setting. We also show that unlike previous fully-supervised settings, LLMs can perform NLI without being limited to a set of known classes, which has practical implications for real-world applications. Finally, we also show that LLMs can provide justification for their choices, providing reasoning based on spelling errors, syntactic patterns, and usage of directly translated linguistic patterns.

computational linguistic, experiment, gpt-4, (13 more...)

arXiv.org Artificial Intelligence

2312.07819

Country:

North America > United States > Colorado > Denver County > Denver (0.05)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Turkish Native Language Identification

Uluslu, Ahmet Yavuz, Schneider, Gerold

arXiv.org Artificial IntelligenceNov-4-2023

In this paper, we present the first application of Native Language Identification (NLI) for the Turkish language. NLI involves predicting the writer's first language by analysing their writing in different languages. While most NLI research has focused on English, our study extends its scope to Turkish. We used the recently constructed Turkish Learner Corpus and employed a combination of three syntactic features (CFG production rules, part-of-speech n-grams, and function words) with L2 texts to demonstrate their effectiveness in this task.

identification, language identification, native language identification, (12 more...)

arXiv.org Artificial Intelligence

2307.1485

Country:

Europe > Switzerland > Zürich > Zürich (0.05)
South America > Brazil (0.04)
Asia > Middle East > Republic of Türkiye (0.04)
Asia > Afghanistan (0.04)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.35)

Add feedback

Scaling Native Language Identification with Transformer Adapters

Uluslu, Ahmet Yavuz, Schneider, Gerold

arXiv.org Artificial IntelligenceNov-18-2022

Native language identification (NLI) is the task of automatically identifying the native language (L1) of an individual based on their language production in a learned language. It is useful for a variety of purposes including marketing, security and educational applications. NLI is usually framed as a multi-label classification task, where numerous designed features are combined to achieve state-of-the-art results. Recently deep generative approach based on transformer decoders (GPT-2) outperformed its counterparts and achieved the best results on the NLI benchmark datasets. We investigate this approach to determine the practical implications compared to traditional state-of-the-art NLI systems. We introduce transformer adapters to address memory limitations and improve training/inference speed to scale NLI applications for production.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2211.10117

Country:

Europe > Switzerland > Zürich > Zürich (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Unravelling Interlanguage Facts via Explainable Machine Learning

Berti, Barbara, Esuli, Andrea, Sebastiani, Fabrizio

arXiv.org Artificial IntelligenceAug-2-2022

Native language identification (NLI) is the task of training (via supervised machine learning) a classifier that guesses the native language of the author of a text. This task has been extensively researched in the last decade, and the performance of NLI systems has steadily improved over the years. We focus on a different facet of the NLI task, i.e., that of analysing the internals of an NLI classifier trained by an \emph{explainable} machine learning algorithm, in order to obtain explanations of its classification decisions, with the ultimate goal of gaining insight into which linguistic phenomena ``give a speaker's native language away''. We use this perspective in order to tackle both NLI and a (much less researched) companion task, i.e., guessing whether a text has been written by a native or a non-native speaker. Using three datasets of different provenance (two datasets of English learners' essays and a dataset of social media posts), we investigate which kind of linguistic traits (lexical, morphological, syntactic, and statistical) are most effective for solving our two tasks, namely, are most indicative of a speaker's L1. We also present two case studies, one on Spanish and one on Italian learners of English, in which we analyse individual linguistic traits that the classifiers have singled out as most important for spotting these L1s. Overall, our study shows that the use of explainable machine learning can be a valuable tool for th

classifier, dataset, learner, (15 more...)

arXiv.org Artificial Intelligence

2208.01468

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > United Kingdom > England > Greater London > London (0.04)
North America > United States > Michigan (0.04)
(18 more...)

Genre: Research Report (0.81)

Industry:

Education > Curriculum > Subject-Specific Education (0.67)
Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.68)
(2 more...)

Add feedback

Topics to Avoid: Demoting Latent Confounds in Text Classification

Kumar, Sachin, Wintner, Shuly, Smith, Noah A., Tsvetkov, Yulia

arXiv.org Machine LearningSep-1-2019

Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not always generalize well. In this work, we observe this limitation with respect to the task of native language identification . We find that standard text classifiers which perform well on the test set end up learning topical features which are confounds of the prediction task (e.g., if the input text mentions Sweden, the classifier predicts that the author's native language is Swedish). We propose a method that represents the latent topical confounds and a model which "unlearns" confounding features by predicting both the label of the input text and the confound; but we train the two predictors adversarially in an alternating fashion to learn a text representation that predicts the correct label but is less prone to using information about the confound. We show that this model generalizes better and learns features that are indicative of the writing style rather than the content.

confound, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

1909.00453

Country:

North America > United States (0.68)
Europe > Sweden (0.48)
Europe > United Kingdom (0.46)
Asia > Middle East > Israel (0.14)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Native Language Identification using i-vector

Uddin, Ahmed Nazim, Rahman, Md Ashequr, Islam, Md. Rafidul, Haque, Mohammad Ariful

arXiv.org Machine LearningNov-9-2018

The task of determining a speaker's native language based only on his speeches in a second language is known as Native Language Identification or NLI. Due to its increasing applications in various domains of speech signal processing, this has emerged as an important research area in recent times. In this paper we have proposed an i-vector based approach to develop an automatic NLI system using MFCC and GFCC features. For evaluation of our approach, we have tested our framework on the 2016 ComParE Native language sub-challenge dataset which has English language speakers from 11 different native language backgrounds. Our proposed method outperforms the baseline system with an improvement in accuracy by 21.95% for the MFCC feature based i-vector framework and 22.81% for the GFCC feature based i-vector framework.

artificial intelligence, i-vector framework, machine learning, (15 more...)

arXiv.org Machine Learning

1811.0554

Country:

Asia > Bangladesh (0.15)
Europe > Spain (0.14)
Europe > Italy (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)

Add feedback