Goto

Collaborating Authors

 Bucharest


The Cognate Data Bottleneck in Language Phylogenetics

Häuser, Luise, Stamatakis, Alexandros

arXiv.org Artificial Intelligence

To fully exploit the potential of computational phylogenetic methods for cognate data one needs to leverage specific (complex) models an machine learning-based techniques. However, both approaches require datasets that are substantially larger than the manually collected cognate data currently available. To the best of our knowledge, there exists no feasible approach to automatically generate larger cognate datasets. We substantiate this claim by automatically extracting datasets from BabelNet, a large multilingual encyclopedic dictionary. We demonstrate that phylogenetic inferences on the respective character matrices yield trees that are largely inconsistent with the established gold standard ground truth trees. We also discuss why we consider it as being unlikely to be able to extract more suitable character matrices from other multilingual resources. Phylogenetic data analysis approaches that require larger datasets can therefore not be applied to cognate data. Thus, it remains an open question how, and if these computational approaches can be applied in historical linguistics.


Putin confirms he wants all of Ukraine, as Europe steps up military aid

Al Jazeera

Ukraine's European allies pledged increased levels of military aid to Ukraine this year, making up for a United States aid freeze, as Russian President Vladimir Putin reaffirmed his ambition to absorb all of Ukraine into the Russian Federation. "At this moment, the Europeans and the Canadians have pledged, for this year, 35bn in military support to Ukraine," said NATO Secretary-General Mark Rutte ahead of the alliance's annual summit, which took place in The Hague on Tuesday and Wednesday, June 24-25. "Last year, it was just over 50bn for the full year. Now, before we reach half year, it is already at 35bn. And there are even others saying it's already close to 40bn," he added.


ViLLa: A Neuro-Symbolic approach for Animal Monitoring

Koduri, Harsha

arXiv.org Artificial Intelligence

Monitoring animal populations in natural environments requires systems that can interpret both visual data and human language queries. This work introduces ViLLa (Vision-Language-Logic Approach), a neuro-symbolic framework designed for interpretable animal monitoring. ViLLa integrates three core components: a visual detection module for identifying animals and their spatial locations in images, a language parser for understanding natural language queries, and a symbolic reasoning layer that applies logic-based inference to answer those queries. Given an image and a question such as "How many dogs are in the scene?" or "Where is the buffalo?", the system grounds visual detections into symbolic facts and uses predefined rules to compute accurate answers related to count, presence, and location. Unlike end-to-end black-box models, ViLLa separates perception, understanding, and reasoning, offering modularity and transparency. The system was evaluated on a range of animal imagery tasks and demonstrates the ability to bridge visual content with structured, human-interpretable queries.


Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet

Augello, Lorenzo, McCrae, John P.

arXiv.org Artificial Intelligence

Open English Wordnet is a key resource published in OntoLex-lemon as part of the linguistic linked open data cloud. There are, however, many links missing in the resource, and in this paper, we look at how we can establish hypernymy between adjectives. We present a theoretical discussion of the hypernymy relation and how it differs for adjectives in contrast to nouns and verbs. We develop a new resource for adjective hypernymy and fine-tune large language models to predict adjective hypernymy, showing that the methodology of TaxoLLaMa can be adapted to this task.


Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales

Tsutsumi, Ayuto, Jinnai, Yuu

arXiv.org Artificial Intelligence

Although Large Language Models (LLMs) have demonstrated strong language understanding and generation abilities across various languages, their cultural knowledge is often limited to English-speaking communities, which can marginalize the cultures of non-English communities. To address the problem, evaluation of the cultural awareness of the LLMs and the methods to develop culturally aware LLMs have been investigated. In this study, we focus on evaluating knowledge of folktales, a key medium for conveying and circulating culture. In particular, we focus on Japanese folktales, specifically on knowledge of Yokai. Yokai are supernatural creatures originating from Japanese folktales that continue to be popular motifs in art and entertainment today. Yokai have long served as a medium for cultural expression, making them an ideal subject for assessing the cultural awareness of LLMs. We introduce YokaiEval, a benchmark dataset consisting of 809 multiple-choice questions (each with four options) designed to probe knowledge about yokai. We evaluate the performance of 31 Japanese and multilingual LLMs on this dataset. The results show that models trained with Japanese language resources achieve higher accuracy than English-centric models, with those that underwent continued pretraining in Japanese, particularly those based on Llama-3, performing especially well. The code and dataset are available at https://github.com/CyberAgentA ILab/YokaiEval.


Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era

Oneata, Dan, Elliott, Desmond, Frank, Stella

arXiv.org Artificial Intelligence

Human learning and conceptual representation is grounded in sensorimotor experience, in contrast to state-of-the-art foundation models. In this paper, we investigate how well such large-scale models, trained on vast quantities of data, represent the semantic feature norms of concrete object concepts, e.g. a ROSE is red, smells sweet, and is a flower. More specifically, we use probing tasks to test which properties of objects these models are aware of. We evaluate image encoders trained on image data alone, as well as multimodally-trained image encoders and language-only models, on predicting an extended denser version of the classic McRae norms and the newer Binder dataset of attribute ratings. We find that multimodal image encoders slightly outperform language-only approaches, and that image-only encoders perform comparably to the language models, even on non-visual attributes that are classified as "encyclopedic" or "function". These results offer new insights into what can be learned from pure unimodal learning, and the complementarity of the modalities.


The mutual exclusivity bias of bilingual visually grounded speech models

Oneata, Dan, Nortje, Leanne, Matusevych, Yevgen, Kamper, Herman

arXiv.org Artificial Intelligence

Mutual exclusivity (ME) is a strategy where a novel word is associated with a novel object rather than a familiar one, facilitating language learning in children. Recent work has found an ME bias in a visually grounded speech (VGS) model trained on English speech with paired images. But ME has also been studied in bilingual children, who may employ it less due to cross-lingual ambiguity. We explore this pattern computationally using bilingual VGS models trained on combinations of English, French, and Dutch. We find that bilingual models generally exhibit a weaker ME bias than monolingual models, though exceptions exist. Analyses show that the combined visual embeddings of bilingual models have a smaller variance for familiar data, partly explaining the increase in confusion between novel and familiar concepts. We also provide new insights into why the ME bias exists in VGS models in the first place. Code and data: https://github.com/danoneata/me-vgs


Automatic Correction of Writing Anomalies in Hausa Texts

Wali, Ahmad Mustapha, Nisioi, Sergiu

arXiv.org Artificial Intelligence

Hausa texts are often characterized by writing anomalies such as incorrect character substitutions and spacing errors, which sometimes hinder natural language processing (NLP) applications. This paper presents an approach to automatically correct the anomalies by finetuning transformer-based models. Using a corpus gathered from several public sources, we created a large-scale parallel dataset of over 450,000 noisy-clean Hausa sentence pairs by introducing synthetically generated noise, fine-tuned to mimic realistic writing errors. Moreover, we adapted several multilingual and African language-focused models, including M2M100, AfriTEVA, mBART, and Opus-MT variants for this correction task using SentencePiece tokenization. Our experimental results demonstrate significant increases in F1, BLEU and METEOR scores, as well as reductions in Character Error Rate (CER) and Word Error Rate (WER). This research provides a robust methodology, a publicly available dataset, and effective models to improve Hausa text quality, thereby advancing NLP capabilities for the language and offering transferable insights for other low-resource languages.


Redefining Research Crowdsourcing: Incorporating Human Feedback with LLM-Powered Digital Twins

Chan, Amanda, Di, Catherine, Rupertus, Joseph, Smith, Gary, Rao, Varun Nagaraj, Ribeiro, Manoel Horta, Monroy-Hernández, Andrés

arXiv.org Artificial Intelligence

Crowd work platforms like Amazon Mechanical Turk and Prolific are vital for research, yet workers' growing use of generative AI tools poses challenges. Researchers face compromised data validity as AI responses replace authentic human behavior, while workers risk diminished roles as AI automates tasks. To address this, we propose a hybrid framework using digital twins, personalized AI models that emulate workers' behaviors and preferences while keeping humans in the loop. We evaluate our system with an experiment (n=88 crowd workers) and in-depth interviews with crowd workers (n=5) and social science researchers (n=4). Our results suggest that digital twins may enhance productivity and reduce decision fatigue while maintaining response quality. Both researchers and workers emphasized the importance of transparency, ethical data use, and worker agency. By automating repetitive tasks and preserving human engagement for nuanced ones, digital twins may help balance scalability with authenticity.


MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark

Croitoru, Florinel-Alin, Hondru, Vlad, Popescu, Marius, Ionescu, Radu Tudor, Khan, Fahad Shahbaz, Shah, Mubarak

arXiv.org Artificial Intelligence

We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection. Our dataset comprises over 250 hours of real and fake videos across eight languages, with 60% of data being generated. For each language, the fake videos are generated with seven distinct deepfake generation models, selected based on the quality of the generated content. We organize the training, validation and test splits such that only a subset of the chosen generative models and languages are available during training, thus creating several challenging open-set evaluation setups. We perform experiments with various pre-trained and fine-tuned deepfake detectors proposed in recent literature. Our results show that state-of-the-art detectors are not currently able to maintain their performance levels when tested in our open-set scenarios. We publicly release our data and code at: https://huggingface.co/datasets/unibuc-cs/MAVOS-DD.