multilingual
- North America > United States > Maryland (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California (0.04)
- (3 more...)
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Despite the existence of various benchmarks for evaluating natural language processing models, we argue that human exams are a more suitable means of evaluating general intelligence for large language models (LLMs), as they inherently demand a much wider range of abilities such as language understanding, domain knowledge, and problem-solving skills. To this end, we introduce M3Exam, a novel benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. M3Exam exhibits three unique characteristics: (1) multilingualism, encompassing questions from multiple countries that require strong multilingual proficiency and cultural knowledge; (2) multimodality, accounting for the multimodal nature of many exam questions to test the model's multimodal understanding capability; and (3) multilevel structure, featuring exams from three critical educational periods to comprehensively assess a model's proficiency at different levels. In total, M3Exam contains 12,317 questions in 9 diverse languages with three educational levels, where about 23\% of the questions require processing images for successful solving. We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text, particularly in low-resource and non-Latin script languages. Multimodal LLMs also perform poorly with complex multimodal questions. We believe that M3Exam can be a valuable resource for comprehensively evaluating LLMs by examining their multilingual and multimodal abilities and tracking their development.
Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual Speech Recognition Evaluation
Srivastav, Vaibhav, Zheng, Steven, Bezzam, Eric, Bihan, Eustache Le, Moumen, Adel, Gandhi, Sanchit
Despite rapid progress, ASR evaluation remains saturated with short-form English, and efficiency is rarely reported. We present the Open ASR Leaderboard, a fully reproducible benchmark and interactive leaderboard comparing 60+ open-source and proprietary systems across 11 datasets, including a dedicated multilingual track. We standardize text normalization and report both word error rate (WER) and inverse real-time factor (RTFx), enabling fair accuracy-efficiency comparisons. For English transcription, Conformer encoders paired with LLM decoders achieve the best average WER but are slower, while CTC and TDT decoders deliver much better RTFx, making them attractive for long-form and offline use. Whisper-derived encoders fine-tuned for English improve accuracy but often trade off multilingual coverage. All code and dataset loaders are open-sourced to support transparent, extensible evaluation.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
Multilingual Pretraining for Pixel Language Models
Kesen, Ilker, Lotz, Jonas F., Ziegler, Ingo, Rust, Phillip, Elliott, Desmond
Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (12 more...)
SHRAG: AFrameworkfor Combining Human-Inspired Search with RAG
Ryu, Hyunseok, Shin, Wonjune, Park, Hyun
Retrieval-Augmented Generation (RAG) is gaining recognition as one of the key technological axes for next generation information retrieval, owing to its ability to mitigate the hallucination phenomenon in Large Language Models (LLMs)and effectively incorporate up-to-date information. However, specialized expertise is necessary to construct ahigh-quality retrieval system independently; moreover, RAGdemonstratesrelativelyslowerprocessing speeds compared to conventional pure retrieval systems because it involves both retrieval and generation stages. Accordingly, this study proposes SHRAG, a novel framework designed to facilitate the seamless integration of Information Retrieval and RAG while simultaneously securing precise retrieval performance. SHRAG utilizes a Large Language Model as a Query Strategist to automatically transform unstructured natural language queries into logically structured search queries, subsequently performing Boolean retrieval to emulate the search process of an expert human searcher. Furthermore, it incorporates multilingual query expansion and a multilingual embedding model, enabling it to perform efficient cross-lingual question answering within the multilingual dataset environment of the ScienceON Challenge. Experimental results demonstrate that the proposed method, combining logical retrieval capabilities and generative reasoning, can significantly enhance the accuracy and reliability of RAG systems. Furthermore, SHRAG movesbeyondconventionaldocument-centric retrieval methods, presenting the potential for a new search paradigm capable of providing direct and reliable responses to queries.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.48)
- North America > United States > Maryland (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California (0.04)
- (3 more...)
Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks
Babakhin, Yauhen, Osmulski, Radek, Ak, Ronay, Moreira, Gabriel, Xu, Mengyao, Schifferer, Benedikt, Liu, Bo, Oldridge, Even
We introduce llama-embed-nemotron-8b, an open-weights text embedding model that achieves state-of-the-art performance on the Multilingual Massive Text Embedding Benchmark (MMTEB) leaderboard as of October 21, 2025. While recent models show strong performance, their training data or methodologies are often not fully disclosed. We aim to address this by developing a fully open-source model, publicly releasing its weights and detailed ablation studies, and planning to share the curated training datasets. Our model demonstrates superior performance across all major embedding tasks -- including retrieval, classification and semantic textual similarity (STS) -- and excels in challenging multilingual scenarios, such as low-resource languages and cross-lingual setups. This state-of-the-art performance is driven by a novel data mix of 16.1 million query-document pairs, split between 7.7 million samples from public datasets and 8.4 million synthetically generated examples from various open-weight LLMs. One of our key contributions is a detailed ablation study analyzing core design choices, including a comparison of contrastive loss implementations, an evaluation of synthetic data generation (SDG) strategies, and the impact of model merging. The llama-embed-nemotron-8b is an instruction-aware model, supporting user-defined instructions to enhance performance for specific use-cases. This combination of top-tier performance, broad applicability, and user-driven flexibility enables it to serve as a universal text embedding solution.
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (3 more...)
S-DAT: A Multilingual, GenAI-Driven Framework for Automated Divergent Thinking Assessment
Haase, Jennifer, Hanel, Paul H. P., Pokutta, Sebastian
This paper introduces S-DAT (Synthetic-Divergent Association Task), a scalable, multilingual framework for automated assessment of divergent thinking (DT) -a core component of human creativity. Traditional creativity assessments are often labor-intensive, language-specific, and reliant on subjective human ratings, limiting their scalability and cross-cultural applicability. In contrast, S-DAT leverages large language models and advanced multilingual embeddings to compute semantic distance -- a language-agnostic proxy for DT. We evaluate S-DAT across eleven diverse languages, including English, Spanish, German, Russian, Hindi, and Japanese (Kanji, Hiragana, Katakana), demonstrating robust and consistent scoring across linguistic contexts. Unlike prior DAT approaches, the S-DAT shows convergent validity with other DT measures and correct discriminant validity with convergent thinking. This cross-linguistic flexibility allows for more inclusive, global-scale creativity research, addressing key limitations of earlier approaches. S-DAT provides a powerful tool for fairer, more comprehensive evaluation of cognitive flexibility in diverse populations and can be freely assessed online: https://sdat.iol.zib.de/.
- Europe > Germany > Berlin (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (3 more...)
- Research Report > Experimental Study (0.67)
- Research Report > Promising Solution (0.46)
- Research Report > New Finding (0.46)
EmbeddingGemma: Powerful and Lightweight Text Representations
Vera, Henrique Schechter, Dua, Sahil, Zhang, Biao, Salz, Daniel, Mullins, Ryan, Panyam, Sindhu Raghuram, Smoot, Sara, Naim, Iftekhar, Zou, Joe, Chen, Feiyang, Cer, Daniel, Lisak, Alice, Choi, Min, Gonzalez, Lucas, Sanseviero, Omar, Cameron, Glenn, Ballantyne, Ian, Black, Kat, Chen, Kaifeng, Wang, Weiyi, Li, Zhe, Martins, Gus, Lee, Jinhyuk, Sherwood, Mark, Ji, Juyeong, Wu, Renjie, Zheng, Jingxiao, Singh, Jyotinder, Sharma, Abheesht, Sreepathihalli, Divyashree, Jain, Aashi, Elarabawy, Adham, Co, AJ, Doumanoglou, Andreas, Samari, Babak, Hora, Ben, Potetz, Brian, Kim, Dahun, Alfonseca, Enrique, Moiseev, Fedor, Han, Feng, Gomez, Frank Palma, Ábrego, Gustavo Hernández, Zhang, Hesen, Hui, Hui, Han, Jay, Gill, Karan, Chen, Ke, Chen, Koert, Shanbhogue, Madhuri, Boratko, Michael, Suganthan, Paul, Duddu, Sai Meher Karthik, Mariserla, Sandeep, Ariafar, Setareh, Zhang, Shanfeng, Zhang, Shijie, Baumgartner, Simon, Goenka, Sonam, Qiu, Steve, Dabral, Tanmaya, Walker, Trevor, Rao, Vikram, Khawaja, Waleed, Zhou, Wenlei, Ren, Xiaoqi, Xia, Ye, Chen, Yichang, Chen, Yi-Ting, Dong, Zhe, Ding, Zhongli, Visin, Francesco, Liu, Gaël, Zhang, Jiageng, Kenealy, Kathleen, Casbon, Michelle, Kumar, Ravin, Mesnard, Thomas, Gleicher, Zach, Brick, Cormac, Lacombe, Olivier, Roberts, Adam, Yin, Qin, Sung, Yunhsuan, Hoffmann, Raphael, Warkentin, Tris, Joulin, Armand, Duerig, Tom, Seyedhosseini, Mojtaba
We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.
- Europe > Middle East > Cyprus > Nicosia > Nicosia (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > UAE (0.04)
Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs
Saeed, Muhammed, Abdul-mageed, Muhammad, Shehata, Shady
Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce DebateBias-8K, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8,400 structured debate prompts spanning four sensitive domains: women's rights, socioeconomic development, terrorism, and religion, across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude 3, DeepSeek, and LLaMA 3), we generate and automatically classify over 100,000 responses. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to terrorism and religion (>=95%), Africans to socioeconomic "backwardness" (up to <=77%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our DebateBias-8K benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (6 more...)
- Law (0.89)
- Law Enforcement & Public Safety > Terrorism (0.69)