multilingual
HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection
Feng, Zixin, Cui, Xinying, Sun, Yifan, Wei, Zheng, Yuan, Jiachen, Hu, Jiazhen, Xin, Ning, Hasan, Md Maruf
Cyberbullying on social media is inherently multilingual and multi-faceted, where abusive behaviors often overlap across multiple categories. Existing methods are commonly limited by monolingual assumptions or single-task formulations, which restrict their effectiveness in realistic multilingual and multi-label scenarios. In this paper, we propose HMS-BERT, a hybrid multi-task self-training framework for multilingual and multi-label cyberbullying detection. Built upon a pretrained multilingual BERT backbone, HMS-BERT integrates contextual representations with handcrafted linguistic features and jointly optimizes a fine-grained multi-label abuse classification task and a three-class main classification task. To address labeled data scarcity in low-resource languages, an iterative self-training strategy with confidence-based pseudo-labeling is introduced to facilitate cross-lingual knowledge transfer. Experiments on four public datasets demonstrate that HMS-BERT achieves strong performance, attaining a macro F1-score of up to 0.9847 on the multi-label task and an accuracy of 0.6775 on the main classification task. Ablation studies further verify the effectiveness of the proposed components.
- Asia > China > Shaanxi Province > Xi'an (0.05)
- North America > United States > Texas > El Paso County > El Paso (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Maryland (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California (0.04)
- (3 more...)
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Despite the existence of various benchmarks for evaluating natural language processing models, we argue that human exams are a more suitable means of evaluating general intelligence for large language models (LLMs), as they inherently demand a much wider range of abilities such as language understanding, domain knowledge, and problem-solving skills. To this end, we introduce M3Exam, a novel benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. M3Exam exhibits three unique characteristics: (1) multilingualism, encompassing questions from multiple countries that require strong multilingual proficiency and cultural knowledge; (2) multimodality, accounting for the multimodal nature of many exam questions to test the model's multimodal understanding capability; and (3) multilevel structure, featuring exams from three critical educational periods to comprehensively assess a model's proficiency at different levels. In total, M3Exam contains 12,317 questions in 9 diverse languages with three educational levels, where about 23\% of the questions require processing images for successful solving. We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text, particularly in low-resource and non-Latin script languages. Multimodal LLMs also perform poorly with complex multimodal questions. We believe that M3Exam can be a valuable resource for comprehensively evaluating LLMs by examining their multilingual and multimodal abilities and tracking their development.
Are LLMs Truly Multilingual? Exploring Zero-Shot Multilingual Capability of LLMs for Information Retrieval: An Italian Healthcare Use Case
Kembu, Vignesh Kumar, Morandini, Pierandrea, Ranzini, Marta Bianca Maria, Nocera, Antonino
Large Language Models (LLMs) have become a key topic in AI and NLP, transforming sectors like healthcare, finance, education, and marketing by improving customer service, automating tasks, providing insights, improving diagnostics, and personalizing learning experiences. Information extraction from clinical records is a crucial task in digital healthcare. Although traditional NLP techniques have been used for this in the past, they often fall short due to the complexity, variability of clinical language, and high inner semantics in the free clinical text. Recently, Large Language Models (LLMs) have become a powerful tool for better understanding and generating human-like text, making them highly effective in this area. In this paper, we explore the ability of open-source multilingual LLMs to understand EHRs (Electronic Health Records) in Italian and help extract information from them in real-time. Our detailed experimental campaign on comorbidity extraction from EHR reveals that some LLMs struggle in zero-shot, on-premises settings, and others show significant variation in performance, struggling to generalize across various diseases when compared to native pattern matching and manual annotations.
- Health & Medicine > Health Care Technology > Medical Record (1.00)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.31)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Multilingual Pretraining for Pixel Language Models
Kesen, Ilker, Lotz, Jonas F., Ziegler, Ingo, Rust, Phillip, Elliott, Desmond
Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.
- Europe (1.00)
- North America > United States > Minnesota (0.28)
- Asia > Middle East > UAE (0.28)
SHRAG: AFrameworkfor Combining Human-Inspired Search with RAG
Ryu, Hyunseok, Shin, Wonjune, Park, Hyun
Retrieval-Augmented Generation (RAG) is gaining recognition as one of the key technological axes for next generation information retrieval, owing to its ability to mitigate the hallucination phenomenon in Large Language Models (LLMs)and effectively incorporate up-to-date information. However, specialized expertise is necessary to construct ahigh-quality retrieval system independently; moreover, RAGdemonstratesrelativelyslowerprocessing speeds compared to conventional pure retrieval systems because it involves both retrieval and generation stages. Accordingly, this study proposes SHRAG, a novel framework designed to facilitate the seamless integration of Information Retrieval and RAG while simultaneously securing precise retrieval performance. SHRAG utilizes a Large Language Model as a Query Strategist to automatically transform unstructured natural language queries into logically structured search queries, subsequently performing Boolean retrieval to emulate the search process of an expert human searcher. Furthermore, it incorporates multilingual query expansion and a multilingual embedding model, enabling it to perform efficient cross-lingual question answering within the multilingual dataset environment of the ScienceON Challenge. Experimental results demonstrate that the proposed method, combining logical retrieval capabilities and generative reasoning, can significantly enhance the accuracy and reliability of RAG systems. Furthermore, SHRAG movesbeyondconventionaldocument-centric retrieval methods, presenting the potential for a new search paradigm capable of providing direct and reliable responses to queries.
- North America > United States (0.28)
- Asia (0.28)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.48)
CodeVaani: A Multilingual, Voice-Based Code Learning Assistant
Havare, Jayant, Tamilselvam, Srikanth, Mittal, Ashish, Thorat, Shalaka, Jadia, Soham, Apte, Varsha, Ramakrishnan, Ganesh
Programming education often assumes English proficiency and text-based interaction, creating barriers for students from multilingual regions such as India. We present CodeVaani, a multilingual speech-driven assistant for understanding code, built into Bodhitree [1], a Learning Management System developed at IIT Bombay. It is a voice-enabled assistant that helps learners explore programming concepts in their native languages. The system integrates Indic ASR, a codeaware transcription refinement module, and a code model for generating relevant answers. Responses are provided in both text and audio for natural interaction. In a study with 28 beginner programmers, CodeVaani achieved 75% response accuracy, with over 80% of participants rating the experience positively. Compared to classroom assistance, our framework offers ondemand availability, scalability to support many learners, and multilingual support that lowers the entry barrier for students with limited English proficiency. The demo will illustrate these capabilities and highlight how voice-based AI systems can make programming education more inclusive. Supplementary artifacts and demo video are also made available.
- Research Report (0.50)
- Questionnaire & Opinion Survey (0.49)
- North America > United States > Maryland (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California (0.04)
- (3 more...)
Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks
Babakhin, Yauhen, Osmulski, Radek, Ak, Ronay, Moreira, Gabriel, Xu, Mengyao, Schifferer, Benedikt, Liu, Bo, Oldridge, Even
We introduce llama-embed-nemotron-8b, an open-weights text embedding model that achieves state-of-the-art performance on the Multilingual Massive Text Embedding Benchmark (MMTEB) leaderboard as of October 21, 2025. While recent models show strong performance, their training data or methodologies are often not fully disclosed. We aim to address this by developing a fully open-source model, publicly releasing its weights and detailed ablation studies, and planning to share the curated training datasets. Our model demonstrates superior performance across all major embedding tasks -- including retrieval, classification and semantic textual similarity (STS) -- and excels in challenging multilingual scenarios, such as low-resource languages and cross-lingual setups. This state-of-the-art performance is driven by a novel data mix of 16.1 million query-document pairs, split between 7.7 million samples from public datasets and 8.4 million synthetically generated examples from various open-weight LLMs. One of our key contributions is a detailed ablation study analyzing core design choices, including a comparison of contrastive loss implementations, an evaluation of synthetic data generation (SDG) strategies, and the impact of model merging. The llama-embed-nemotron-8b is an instruction-aware model, supporting user-defined instructions to enhance performance for specific use-cases. This combination of top-tier performance, broad applicability, and user-driven flexibility enables it to serve as a universal text embedding solution.
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (3 more...)
EmbeddingGemma: Powerful and Lightweight Text Representations
Vera, Henrique Schechter, Dua, Sahil, Zhang, Biao, Salz, Daniel, Mullins, Ryan, Panyam, Sindhu Raghuram, Smoot, Sara, Naim, Iftekhar, Zou, Joe, Chen, Feiyang, Cer, Daniel, Lisak, Alice, Choi, Min, Gonzalez, Lucas, Sanseviero, Omar, Cameron, Glenn, Ballantyne, Ian, Black, Kat, Chen, Kaifeng, Wang, Weiyi, Li, Zhe, Martins, Gus, Lee, Jinhyuk, Sherwood, Mark, Ji, Juyeong, Wu, Renjie, Zheng, Jingxiao, Singh, Jyotinder, Sharma, Abheesht, Sreepathihalli, Divyashree, Jain, Aashi, Elarabawy, Adham, Co, AJ, Doumanoglou, Andreas, Samari, Babak, Hora, Ben, Potetz, Brian, Kim, Dahun, Alfonseca, Enrique, Moiseev, Fedor, Han, Feng, Gomez, Frank Palma, Ábrego, Gustavo Hernández, Zhang, Hesen, Hui, Hui, Han, Jay, Gill, Karan, Chen, Ke, Chen, Koert, Shanbhogue, Madhuri, Boratko, Michael, Suganthan, Paul, Duddu, Sai Meher Karthik, Mariserla, Sandeep, Ariafar, Setareh, Zhang, Shanfeng, Zhang, Shijie, Baumgartner, Simon, Goenka, Sonam, Qiu, Steve, Dabral, Tanmaya, Walker, Trevor, Rao, Vikram, Khawaja, Waleed, Zhou, Wenlei, Ren, Xiaoqi, Xia, Ye, Chen, Yichang, Chen, Yi-Ting, Dong, Zhe, Ding, Zhongli, Visin, Francesco, Liu, Gaël, Zhang, Jiageng, Kenealy, Kathleen, Casbon, Michelle, Kumar, Ravin, Mesnard, Thomas, Gleicher, Zach, Brick, Cormac, Lacombe, Olivier, Roberts, Adam, Yin, Qin, Sung, Yunhsuan, Hoffmann, Raphael, Warkentin, Tris, Joulin, Armand, Duerig, Tom, Seyedhosseini, Mojtaba
We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.
- Europe > Middle East > Cyprus > Nicosia > Nicosia (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > UAE (0.04)