Goto

Collaborating Authors

 Mindanao


HiligayNER: A Baseline Named Entity Recognition Model for Hiligaynon

arXiv.org Artificial Intelligence

The language of Hiligaynon, spoken predominantly by the people of Panay Island, Negros Occidental, and Soccsksargen in the Philippines, remains underrepresented in language processing research due to the absence of annotated corpora and baseline models. This study introduces HiligayNER, the first publicly available baseline model for the task of Named Entity Recognition (NER) in Hiligaynon. The dataset used to build HiligayNER contains over 8,000 annotated sentences collected from publicly available news articles, social media posts, and literary texts. Two Transformer-based models, mBERT and XLM-RoBERTa, were fine-tuned on this collected corpus to build versions of HiligayNER. Evaluation results show strong performance, with both models achieving over 80% in precision, recall, and F1-score across entity types. Furthermore, cross-lingual evaluation with Cebuano and Tagalog demonstrates promising transferability, suggesting the broader applicability of HiligayNER for multilingual NLP in low-resource settings. This work aims to contribute to language technology development for underrepresented Philippine languages, specifically for Hiligaynon, and support future research in regional language processing.


Assessing the Dynamics of the Coffee Value Chain in Davao del Sur: An Agent-Based Modeling Approach

arXiv.org Artificial Intelligence

The study investigates the coffee value chain dynamics in Davao del Sur using an agent-based model. Three main factors driving interactions among key players were identified: trust, risk, and transaction costs. The model was constructed using NetLogo 6.3.0, and data from a survey questionnaire collected three data points from BACOFA members. Five cases were explored, with each scenario simulated 1000 times. Findings suggest that producers often sell to the market rather than the cooperative due to higher prices. However, producers tend to prioritize trust in buyers and their risk attitude, leading to increased sales to the cooperative. The producer's risk attitude significantly influences their decision-making, affecting performance outcomes such as loans, demand, and price changes. All three factors play a role and exert varying impacts on the value chain. So, the stakeholders' decisions on prioritizing factors in improving relationships depend on their priorities. Nonetheless, simulations show that establishing a harmonious system benefiting all parties is possible. However, achieving this requires adjustments to demand, pricing, trust, and risk attitudes of key players, which may not align with the preferences of some parties in reality.


Bounds in Wasserstein Distance for Locally Stationary Functional Time Series

arXiv.org Machine Learning

Functional time series (FTS) extend traditional methodologies to accommodate data observed as functions/curves. A significant challenge in FTS consists of accurately capturing the time-dependence structure, especially with the presence of time-varying covariates. When analyzing time series with time-varying statistical properties, locally stationary time series (LSTS) provide a robust framework that allows smooth changes in mean and variance over time. This work investigates Nadaraya-Watson (NW) estimation procedure for the conditional distribution of locally stationary functional time series (LSFTS), where the covariates reside in a semi-metric space endowed with a semi-metric. Under small ball probability and mixing condition, we establish convergence rates of NW estimator for LSFTS with respect to Wasserstein distance. The finite-sample performances of the model and the estimation method are illustrated through extensive numerical experiments both on functional simulated and real data.


Bounds in Wasserstein distance for locally stationary processes

arXiv.org Machine Learning

Locally stationary processes (LSPs) provide a robust framework for modeling time-varying phenomena, allowing for smooth variations in statistical properties such as mean and variance over time. In this paper, we address the estimation of the conditional probability distribution of LSPs using Nadaraya-Watson (NW) type estimators. The NW estimator approximates the conditional distribution of a target variable given covariates through kernel smoothing techniques. We establish the convergence rate of the NW conditional probability estimator for LSPs in the univariate setting under the Wasserstein distance and extend this analysis to the multivariate case using the sliced Wasserstein distance. Theoretical results are supported by numerical experiments on both synthetic and real-world datasets, demonstrating the practical usefulness of the proposed estimators.


Evaluating Algorithmic Bias in Models for Predicting Academic Performance of Filipino Students

arXiv.org Artificial Intelligence

Algorithmic bias is a major issue in machine learning models in educational contexts. However, it has not yet been studied thoroughly in Asian learning contexts, and only limited work has considered algorithmic bias based on regional (sub-national) background. As a step towards addressing this gap, this paper examines the population of 5,986 students at a large university in the Philippines, investigating algorithmic bias based on students' regional background. The university used the Canvas learning management system (LMS) in its online courses across a broad range of domains. Over the period of three semesters, we collected 48.7 million log records of the students' activity in Canvas. We used these logs to train binary classification models that predict student grades from the LMS activity. The best-performing model reached AUC of 0.75 and weighted F1-score of 0.79. Subsequently, we examined the data for bias based on students' region. Evaluation using three metrics: AUC, weighted F1-score, and MADD showed consistent results across all demographic groups. Thus, no unfairness was observed against a particular student group in the grade predictions.


SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

arXiv.org Artificial Intelligence

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.


A quantitative and typological study of Early Slavic participle clauses and their competition

arXiv.org Artificial Intelligence

This thesis is a corpus-based, quantitative, and typological analysis of the functions of Early Slavic participle constructions and their finite competitors ($jegda$-'when'-clauses). The first part leverages detailed linguistic annotation on Early Slavic corpora at the morphosyntactic, dependency, information-structural, and lexical levels to obtain indirect evidence for different potential functions of participle clauses and their main finite competitor and understand the roles of compositionality and default discourse reasoning as explanations for the distribution of participle constructions and $jegda$-clauses in the corpus. The second part uses massively parallel data to analyze typological variation in how languages express the semantic space of English $when$, whose scope encompasses that of Early Slavic participle constructions and $jegda$-clauses. Probabilistic semantic maps are generated and statistical methods (including Kriging, Gaussian Mixture Modelling, precision and recall analysis) are used to induce cross-linguistically salient dimensions from the parallel corpus and to study conceptual variation within the semantic space of the hypothetical concept WHEN.


Calibrating Long-form Generations from Large Language Models

arXiv.org Artificial Intelligence

To enhance Large Language Models' (LLMs) reliability, calibration is essential -- the model's assessed confidence scores should align with the actual likelihood of its responses being correct. However, current confidence elicitation methods and calibration metrics typically rely on a binary true/false assessment of response correctness. This approach does not apply to long-form generation, where an answer can be partially correct. Addressing this gap, we introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores. Within this framework, we develop three metrics to precisely evaluate LLM calibration and further propose two confidence elicitation methods based on self-consistency and self-evaluation. Our experiments, which include long-form QA and summarization tasks, demonstrate that larger models don't necessarily guarantee better calibration, that calibration performance is found to be metric-dependent, and that self-consistency methods excel in factoid datasets. We also find that calibration can be enhanced through techniques such as fine-tuning, integrating relevant source documents, scaling the temperature, and combining self-consistency with self-evaluation. Lastly, we showcase a practical application of our system: selecting and cascading open-source models and ChatGPT to optimize correctness given a limited API budget. This research not only challenges existing notions of LLM calibration but also offers practical methodologies for improving trustworthiness in long-form generation.


Improved Forecasting Using a PSO-RDV Framework to Enhance Artificial Neural Network

arXiv.org Artificial Intelligence

Decision making and planning have long relied heavily on AI-driven forecasts. The government and the general public are working to minimize the risks while maximizing benefits in the face of potential future public health uncertainties. This study used an improved method of forecasting utilizing the Random Descending Velocity Inertia Weight (RDV IW) technique to improve the convergence of Particle Swarm Optimization (PSO) and the accuracy of Artificial Neural Network (ANN). The IW technique, inspired by the motions of a golf ball, modified the particles' velocities as they approached the solution point to a parabolically descending structure. Simulation results revealed that the proposed forecasting model with [0.4, 0.9] combination of alpha and alpha_dump exhibits a 6.36% improvement in position error and 11.75% improvement in computational time compared to the old model, thus, improving its convergence. It reached the optimum level at minimal steps with 12.50% improvement as against the old model since it provides better velocity averages when speed stabilization occurs at the 24th iteration. Meanwhile, the computed p-values for NRMSE (0.04889174), MAE (0.02829063), MAPE (0.02226053), WAPE (0.01701545), and R2 (0.00000021) of the proposed algorithm are less than the set 0.05 level of significance, thus the values indicated a significant result in terms of accuracy performance. Applying the modified ANN-PSO using RDV IW technique greatly improved the new HIV/AIDS forecasting model compared with the two models.


GlotLID: Language Identification for Low-Resource Languages

arXiv.org Artificial Intelligence

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.