AITopics | tagalog

Collaborating Authors

tagalog

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

HiligayNER: A Baseline Named Entity Recognition Model for Hiligaynon

Teves, James Ald, Cal, Ray Daniel, Villaluz, Josh Magdiel, Malolos, Jean, Magtira, Mico, Rodriguez, Ramon, Abisado, Mideth, Imperial, Joseph Marvin

arXiv.org Artificial IntelligenceOct-14-2025

The language of Hiligaynon, spoken predominantly by the people of Panay Island, Negros Occidental, and Soccsksargen in the Philippines, remains underrepresented in language processing research due to the absence of annotated corpora and baseline models. This study introduces HiligayNER, the first publicly available baseline model for the task of Named Entity Recognition (NER) in Hiligaynon. The dataset used to build HiligayNER contains over 8,000 annotated sentences collected from publicly available news articles, social media posts, and literary texts. Two Transformer-based models, mBERT and XLM-RoBERTa, were fine-tuned on this collected corpus to build versions of HiligayNER. Evaluation results show strong performance, with both models achieving over 80% in precision, recall, and F1-score across entity types. Furthermore, cross-lingual evaluation with Cebuano and Tagalog demonstrates promising transferability, suggesting the broader applicability of HiligayNER for multilingual NLP in low-resource settings. This work aims to contribute to language technology development for underrepresented Philippine languages, specifically for Hiligaynon, and support future research in regional language processing.

computational linguistic, information retrieval, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2510.10776

Country:

North America > United States > Minnesota (0.28)
Asia > Philippines > Visayas > Negros Island Region > Province of Negros Occidental (0.24)
Asia > Philippines > Mindanao > Soccsksargen (0.24)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project

Aquino, Angelina A., Miranda, Lester James V., Or, Elsie Marie T.

arXiv.org Artificial IntelligenceMay-28-2025

This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually annotated according to the Universal Dependencies framework. We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures. We provide baseline evaluations using multiple transformer-based models to assess the performance of state-of-the-art dependency parsers on Tagalog. We also highlight challenges in the syntactic analysis of Tagalog given its distinctive grammatical properties, and discuss its implications for the annotation of this treebank. We anticipate that UD-NewsCrawl and our baseline model implementations will serve as valuable resources for advancing computational linguistics research in underrepresented languages like Tagalog.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2505.20428

Country:

Europe (1.00)
Asia > Middle East (0.67)
North America > United States > Minnesota (0.28)
Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (0.93)
Leisure & Entertainment > Sports > Basketball (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Extracting General-use Transformers for Low-resource Languages via Knowledge Distillation

Cruz, Jan Christian Blaise, Aji, Alham Fikri

arXiv.org Artificial IntelligenceJan-22-2025

In this paper, we propose the use of simple knowledge distillation to produce smaller and more efficient single-language transformers from Massively Multilingual Transformers (MMTs) to alleviate tradeoffs associated with the use of such in low-resource settings. Using Tagalog as a case study, we show that these smaller single-language models perform on-par with strong baselines in a variety of benchmark tasks in a much more efficient manner. Furthermore, we investigate additional steps during the distillation process that improves the soft-supervision of the target language, and provide a number of analyses and ablations to show the efficacy of the proposed method.

distillation, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.1266

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Washington > King County > Seattle (0.05)
North America > Canada > Ontario > Toronto (0.05)
(3 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

calamanCy: A Tagalog Natural Language Processing Toolkit

Miranda, Lester James V.

arXiv.org Artificial IntelligenceNov-13-2023

We introduce calamanCy, an open-source toolkit for constructing natural language processing (NLP) pipelines for Tagalog. It is built on top of spaCy, enabling easy experimentation and integration with other frameworks. calamanCy addresses the development gap by providing a consistent API for building NLP applications and offering general-purpose multitask models with out-of-the-box support for dependency parsing, parts-of-speech (POS) tagging, and named entity recognition (NER). calamanCy aims to accelerate the progress of Tagalog NLP by consolidating disjointed resources in a unified framework. The calamanCy toolkit is available on GitHub: https://github.com/ljvmiranda921/calamanCy.

calamancy, computational linguistic, pipeline, (13 more...)

arXiv.org Artificial Intelligence

2311.07171

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
Asia > Philippines > Luzon > National Capital Region > City of Manila (0.05)
(5 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.90)

Add feedback

Developing a Named Entity Recognition Dataset for Tagalog

Miranda, Lester James V.

arXiv.org Artificial IntelligenceNov-13-2023

We present the development of a Named Entity Recognition (NER) dataset for Tagalog. This corpus helps fill the resource gap present in Philippine languages today, where NER resources are scarce. The texts were obtained from a pretraining corpora containing news reports, and were labeled by native speakers in an iterative fashion. The resulting dataset contains ~7.8k documents across three entity types: Person, Organization, and Location. The inter-annotator agreement, as measured by Cohen's $\kappa$, is 0.81. We also conducted extensive empirical evaluation of state-of-the-art methods across supervised and transfer learning settings. Finally, we released the data and processing code publicly to inspire future work on Tagalog NLP.

computational linguistic, tagalog, tlu nified -ner, (13 more...)

arXiv.org Artificial Intelligence

2311.07161

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Philippines > Luzon > National Capital Region > City of Manila (0.05)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(8 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Automatic Readability Assessment for Closely Related Languages

Imperial, Joseph Marvin, Kochmar, Ekaterina

arXiv.org Artificial IntelligenceMay-25-2023

In recent years, the main focus of research on automatic readability assessment (ARA) has shifted towards using expensive deep learning-based methods with the primary goal of increasing models' accuracy. This, however, is rarely applicable for low-resource languages where traditional handcrafted features are still widely used due to the lack of existing NLP tools to extract deeper linguistic representations. In this work, we take a step back from the technical component and focus on how linguistic aspects such as mutual intelligibility or degree of language relatedness can improve ARA in a low-resource setting. We collect short stories written in three languages in the Philippines-Tagalog, Bikol, and Cebuano-to train readability assessment models and explore the interaction of data and features in various cross-lingual setups. Our results show that the inclusion of CrossNGO, a novel specialized feature exploiting n-gram overlap applied to languages with high mutual intelligibility, significantly improves the performance of ARA models compared to the use of off-the-shelf large multilingual language models alone. Consequently, when both linguistic representations are combined, we achieve state-of-the-art results for Tagalog and Cebuano, and baseline scores for ARA in Bikol.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2305.13478

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(14 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Benchmarking zero-shot and few-shot approaches for tokenization, tagging, and dependency parsing of Tagalog text

Aquino, Angelina, de Leon, Franz

arXiv.org Artificial IntelligenceJan-5-2023

The grammatical analysis of texts in any written language typically involves a number of basic processing tasks, such as tokenization, morphological tagging, and dependency parsing. State-of-the-art systems can achieve high accuracy on these tasks for languages with large datasets, but yield poor results for languages which have little to no annotated data. To address this issue for the Tagalog language, we investigate the use of alternative language resources for creating task-specific models in the absence of dependency-annotated Tagalog data. We also explore the use of word embeddings and data augmentation to improve performance when only a small amount of annotated Tagalog data is available. We show that these zero-shot and few-shot approaches yield substantial improvements on grammatical analysis of both in-domain and out-of-domain Tagalog text compared to state-of-the-art supervised baselines.

artificial intelligence, natural language, pipeline, (18 more...)

arXiv.org Artificial Intelligence

2208.01814

Country:

Asia > Philippines > Luzon > National Capital Region > City of Manila (0.15)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(16 more...)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

Back Translation Survey for Improving Text Augmentation

Ciolino, Matthew, Noever, David, Kalin, Josh

arXiv.org Artificial IntelligenceNov-16-2022

Natural Language Processing (NLP) relies heavily on training data. Transformers, as they have gotten bigger, have required massive amounts of training data. To satisfy this requirement, text augmentation should be looked at as a way to expand your current dataset and to generalize your models. One text augmentation we will look at is translation augmentation. We take an English sentence and translate it to another language before translating it back to English. In this paper, we look at the effect of 108 different language back translations on various metrics and text embeddings.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2102.09708

Country:

Asia > Myanmar (0.05)
South America > Brazil (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(3 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Detecting Social Media Manipulation in Low-Resource Languages

Haider, Samar, Luceri, Luca, Deb, Ashok, Badawy, Adam, Peng, Nanyun, Ferrara, Emilio

arXiv.org Artificial IntelligenceNov-10-2020

Social media have been deliberately used for malicious purposes, including political manipulation and disinformation. Most research focuses on high-resource languages. However, malicious actors share content across countries and languages, including low-resource ones. Here, we investigate whether and to what extent malicious actors can be detected in low-resource language settings. We discovered that a high number of accounts posting in Tagalog were suspended as part of Twitter's crackdown on interference operations after the 2016 US Presidential election. By combining text embedding and transfer learning, our framework can detect, with promising accuracy, malicious users posting in Tagalog without any prior knowledge or training on malicious content in that language. We first learn an embedding model for each language, namely a high-resource language (English) and a low-resource one (Tagalog), independently. Then, we learn a mapping between the two latent spaces to transfer the detection model. We demonstrate that the proposed approach significantly outperforms state-of-the-art models, including BERT, and yields marked advantages in settings with very limited training data-the norm when dealing with detecting malicious activity in online platforms.

ferrara, low-resource language, social media, (16 more...)

arXiv.org Artificial Intelligence

2011.05367

Country:

North America > United States > California (0.14)
Asia > Philippines (0.04)
South America > Venezuela (0.04)
(6 more...)

Genre: Research Report > Promising Solution (0.34)

Industry:

Media (1.00)
Information Technology > Services (1.00)
Government > Voting & Elections (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback