AITopics | Kohima

Collaborating Authors

Kohima

Part-of-speech tagging for Nagamese Language using CRF

Shohe, Alovi N, Khiamungam, Chonglio, Angami, Teisovi

arXiv.org Artificial IntelligenceOct-14-2025

This paper investigates part-of-speech tagging, an important task in Natural Language Processing (NLP) for the Nagamese language. The Nagamese language, a.k.a. Naga Pidgin, is an Assamese-lexified Creole language developed primarily as a means of communication in trade between the Nagas and people from Assam in northeast India. A substantial amount of work in part-of-speech-tagging has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in the Nagamese language. To the best of our knowledge, this is the first attempt at part-of-speech tagging for the Nagamese Language. The aim of this work is to identify the part-of-speech for a given sentence in the Nagamese language. An annotated corpus of 16,112 tokens is created and applied machine learning technique known as Conditional Random Fields (CRF). Using CRF, an overall tagging accuracy of 85.70%; precision, recall of 86%, and f1-score of 85% is achieved. Keywords. Nagamese, NLP, part-of-speech, machine learning, CRF.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.19343

Country: Asia > India > Nagaland > Kohima (0.04)

Genre:

Research Report (0.90)
Overview (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Tenyidie Syllabification corpus creation and deep learning applications

Angami, Teisovi, Khate, Kevisino

arXiv.org Artificial IntelligenceOct-3-2025

The Tenyidie language is a low-resource language of the Tibeto-Burman family spoken by the Tenyimia Community of Nagaland in the north-eastern part of India and is considered a major language in Nagaland. It is tonal, Subject-Object-Verb, and highly agglutinative in nature. Being a low-resource language, very limited research on Natural Language Processing (NLP) has been conducted. To the best of our knowledge, no work on syllabification has been reported for this language. Among the many NLP tasks, syllabification or syllabication is an important task in which the given word syllables are identified. The contribution of this work is the creation of 10,120 syllabified Tenyidie words and the application of the Deep Learning techniques on the created corpus. In this paper, we have applied LSTM, BLSTM, BLSTM+CRF, and Encoder-decoder deep learning architectures on our created dataset. In our dataset split of 80:10:10 (train:validation:test) set, we achieved the highest accuracy of 99.21% with BLSTM model on the test set. This work will find its application in numerous other NLP applications, such as morphological analysis, part-of-speech tagging, machine translation, etc, for the Tenyidie Language.

artificial intelligence, deep learning, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2510.00629

Country:

North America > United States (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > India > Nagaland > Kohima (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Tone recognition in low-resource languages of North-East India: peeling the layers of SSL-based speech models

Gogoi, Parismita, Kalita, Sishir, Lalhminghlui, Wendy, Terhiija, Viyazonuo, Tzudir, Moakala, Sarmah, Priyankoo, Prasanna, S. R. M.

arXiv.org Artificial IntelligenceJun-5-2025

This study explores the use of self-supervised learning (SSL) models for tone recognition in three low-resource languages from North Eastern India: Angami, Ao, and Mizo. We evaluate four Wav2vec2.0 base models that were pre-trained on both tonal and non-tonal languages. We analyze tone-wise performance across the layers for all three languages and compare the different models. Our results show that tone recognition works best for Mizo and worst for Angami. The middle layers of the SSL models are the most important for tone recognition, regardless of the pre-training language, i.e. tonal or non-tonal. We have also found that the tone inventory, tone types, and dialectal variations affect tone recognition. These findings provide useful insights into the strengths and weaknesses of SSL-based embeddings for tonal languages and highlight the potential for improving tone recognition in low-resource settings. The source code is available at GitHub 1 .

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2506.03606

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > India > Nagaland > Kohima (0.04)
North America > Canada > Ontario > Toronto (0.04)
(10 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)

Add feedback

Learning to Plan for Language Modeling from Unlabeled Data

Cornille, Nathan, Moens, Marie-Francine, Mai, Florian

arXiv.org Artificial IntelligenceMar-31-2024

By training to predict the next token in an unlabeled corpus, large language models learn to perform many tasks without any labeled data. However, their next-token-prediction objective arguably limits their performance in scenarios that require planning, such as writing a coherent article. In this paper, we train a module for planning the future writing process via a self-supervised learning objective. By conditioning on generated latent plans, our model extends the successful language model formula to more abstract planning in an unsupervised way. Empirically, we demonstrate that our method improves language modeling performance in general, particularly with respect to the text structure. Because our framework uses a planner module that is unsupervised and external to the language model, new planner modules can be trained at large scale and easily be shared with the community.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2404.00614

Country:

Africa > Ethiopia (0.14)
North America > United States > Pennsylvania > Berks County > Reading (0.04)
Europe > Italy (0.04)
(19 more...)

Genre: Research Report (0.50)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Detecting Pretraining Data from Large Language Models

Shi, Weijia, Ajith, Anirudh, Xia, Mengzhou, Huang, Yangsibo, Liu, Daogao, Blevins, Terra, Chen, Danqi, Zettlemoyer, Luke

arXiv.org Artificial IntelligenceNov-3-2023

Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In this paper, we study the pretraining data detection problem: given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that uses data created before and after model training to support gold truth detection. We also introduce a new detection method Min-K% Prob based on a simple hypothesis: an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. Min-K% Prob can be applied without any knowledge about the pretraining corpus or any additional training, departing from previous detection methods that require training a reference model on data that is similar to the pretraining data. Moreover, our experiments demonstrate that Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous methods. We apply Min-K% Prob to three real-world scenarios, copyrighted book detection, contaminated downstream example detection and privacy auditing of machine unlearning, and find it a consistently effective solution.

arxiv preprint arxiv, harry potter, language model, (12 more...)

arXiv.org Artificial Intelligence

2310.16789

Country:

North America > United States > California (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

Best of the web: Artificial Intelligence news for October 22, 2016

#artificialintelligenceOct-22-2016, 23:20:57 GMT

With Stephen Hawking opening an AI lab it's only a matter of time before smart robots take over for humans in the factory, on the battlefield, in the supermarket, and behind the counter. There's an old Chinese saying: "If you want to do anything good, easy and fast, you need connections," said Nancy Yang, a spokesperson for the fourth annual Seattle Biz-Tech Summit meeting today in Bellevue, Wash., outside Seattle. "Here in the U.S., we use email and messaging, but the Chinese way, and really for many Asians, is to meet face to face." Tanvi Lad shook off her first game loss and pulled off a rare victory over Rituparna Das in a three-set match and entered the women's singles final of the Manorama-Indian Open National-ranking badminton tournament here on Saturday. Stephen Hawking, the famous scientist who once said intelligent machines could be mankind's biggest threat, opened an artificial intelligence lab in Britain this week to help develop robot surgeons and Terminator-style military droids.

artificial intelligence news, natural language, university, (6 more...)

#artificialintelligence

Country:

North America > United States > Washington > King County > Bellevue (0.38)
Asia > India > Nagaland > Kohima (0.09)
North America > United States > New York (0.06)
(7 more...)

Industry: Leisure & Entertainment > Sports (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (0.61)
Information Technology > Artificial Intelligence > Natural Language (0.42)

Add feedback