South America
M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
Anugraha, David, Irawan, Patrick Amadeus, Singh, Anshul, Lee, En-Shiun Annie, Winata, Genta Indra
Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.
Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures
Oskooei, Amirkia Rafiei, Yukcu, S. Selcan, Bozoglan, Mehmet Cevheri, Aktas, Mehmet S.
Bug localization in multi-repository microservice architectures is challenging due to the semantic gap between natural language bug reports and code, LLM context limitations, and the need to first identify the correct repository. We propose reframing this as a natural language reasoning task by transforming codebases into hierarchical NL summaries and performing NL-to-NL search instead of cross-modal retrieval. Our approach builds context-aware summaries at file, directory, and repository levels, then uses a two-phase search: first routing bug reports to relevant repositories, then performing top-down localization within those repositories. Evaluated on DNext, an industrial system with 46 repositories and 1.1M lines of code, our method achieves Pass@10 of 0.82 and MRR of 0.50, significantly outperforming retrieval baselines and agentic RAG systems like GitHub Copilot and Cursor. This work demonstrates that engineered natural language representations can be more effective than raw source code for scalable bug localization, providing an interpretable repository -> directory -> file search path, which is vital for building trust in enterprise AI tools by providing essential transparency.
WhatsCode: Large-Scale GenAI Deployment for Developer Efficiency at WhatsApp
Mao, Ke, Kapus, Timotej, ร hs, Cons T, Marescotti, Matteo, Ip, Daniel, Hajdu, รkos, Cela, Sopot, Banerjee, Aparup
The deployment of AI-assisted development tools in compliance-relevant, large-scale industrial environments represents significant gaps in academic literature, despite growing industry adoption. We report on the industrial deployment of WhatsCode, a domain-specific AI development system that supports WhatsApp (serving over 2 billion users) and processes millions of lines of code across multiple platforms. Over 25 months (2023-2025), WhatsCode evolved from targeted privacy automation to autonomous agentic workflows integrated with end-to-end feature development and DevOps processes. WhatsCode achieved substantial quantifiable impact, improving automated privacy verification coverage 3.5x from 15% to 53%, identifying privacy requirements, and generating over 3,000 accepted code changes with acceptance rates ranging from 9% to 100% across different automation domains. The system committed 692 automated refactor/fix changes, 711 framework adoptions, 141 feature development assists and maintained 86% precision in bug triage. Our study identifies two stable human-AI collaboration patterns that emerged from production deployment: one-click rollout for high-confidence changes (60% of cases) and commandeer-revise for complex decisions (40%). We demonstrate that organizational factors, such as ownership models, adoption dynamics, and risk management, are as decisive as technical capabilities for enterprise-scale AI success. The findings provide evidence-based guidance for large-scale AI tool deployment in compliance-relevant environments, showing that effective human-AI collaboration, not full automation, drives sustainable business impact.
ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal
Zhang, Haonan, Wang, Dongxia, Liu, Yi, Chen, Kexin, Wang, Jiashui, Ying, Xinlei, Liu, Long, Wang, Wenhai
Large Language Models (LLMs) increasingly exhibit over-refusal - erroneously rejecting benign queries due to overly conservative safety measures - a critical functional flaw that undermines their reliability and usability. Current methods for testing this behavior are demonstrably inadequate, suffering from flawed benchmarks and limited test generation capabilities, as highlighted by our empirical user study. To the best of our knowledge, this paper introduces the first evolutionary testing framework, ORFuzz, for the systematic detection and analysis of LLM over-refusals. ORFuzz uniquely integrates three core components: (1) safety category-aware seed selection for comprehensive test coverage, (2) adaptive mutator optimization using reasoning LLMs to generate effective test cases, and (3) OR-Judge, a human-aligned judge model validated to accurately reflect user perception of toxicity and refusal. Our extensive evaluations demonstrate that ORFuzz generates diverse, validated over-refusal instances at a rate (6.98% average) more than double that of leading baselines, effectively uncovering vulnerabilities. Furthermore, ORFuzz's outputs form the basis of ORFuzzSet, a new benchmark of 1,855 highly transferable test cases that achieves a superior 63.56% average over-refusal rate across 10 diverse LLMs, significantly outperforming existing datasets. ORFuzz and ORFuzzSet provide a robust automated testing framework and a valuable community resource, paving the way for developing more reliable and trustworthy LLM-based software systems.
SustainDiffusion: Optimising the Social and Environmental Sustainability of Stable Diffusion Models
d'Aloisio, Giordano, Fadahunsi, Tosin, Choy, Jay, Moussa, Rebecca, Sarro, Federica
Background: Text-to-image generation models are widely used across numerous domains. Among these models, Stable Diffusion (SD) - an open-source text-to-image generation model - has become the most popular, producing over 12 billion images annually. However, the widespread use of these models raises concerns regarding their social and environmental sustainability. Aims: To reduce the harm that SD models may have on society and the environment, we introduce SustainDiffusion, a search-based approach designed to enhance the social and environmental sustainability of SD models. Method: SustainDiffusion searches the optimal combination of hyperparameters and prompt structures that can reduce gender and ethnic bias in generated images while also lowering the energy consumption required for image generation. Importantly, SustainDiffusion maintains image quality comparable to that of the original SD model. Results: We conduct a comprehensive empirical evaluation of SustainDiffusion, testing it against six different baselines using 56 different prompts. Our results demonstrate that SustainDiffusion can reduce gender bias in SD3 by 68%, ethnic bias by 59%, and energy consumption (calculated as the sum of CPU and GPU energy) by 48%. Additionally, the outcomes produced by SustainDiffusion are consistent across multiple runs and can be generalised to various prompts. Conclusions: With SustainDiffusion, we demonstrate how enhancing the social and environmental sustainability of text-to-image generation models is possible without fine-tuning or changing the model's architecture.
Government promises 50,000 new apprenticeships in youth employment push
The government says some 50,000 young people are expected to benefit from a programme to expand apprenticeships as it looks to tackle youth unemployment. The ยฃ725 million package, which was earmarked in the Budget and covers the next three years, will be used to create apprenticeships in sectors including AI, hospitality and engineering. Apprenticeships for people under the age of 25 at small and medium-sized businesses will be fully funded as part of the package, removing the 5% that they currently have to pay. The government is aiming to reverse a decline in the number of young people starting apprenticeships, which has fallen by almost 40% in the past decade. The funding also includes ยฃ140m for a pilot that the Department for Work and Pensions says will allow local mayors to connect young people with employers and apprenticeship opportunities, although it is unclear exactly how the money will be used.
Rugby star Sinfield completes gruelling ultramarathon challenge in memory of Rob Burrow
Kevin Sinfield has completed seven ultramarathons in seven days to raise money and awareness for motor neurone disease (MND). The rugby league legend ran about 300km (185 miles) throughout the week, starting at Bury St Edmunds Rugby Club and ending at Leeds Rhinos home ground, Headingley Stadium. The 45-year-old completed an ultramarathon of at least 45km (27.9 miles) each day of his challenge, in bursts of 7km (4.3 miles). On Sunday he crossed the finish line in front of hundreds of supporters, who had gathered in the stadium's North and West stands to cheer him on. He said: To the MND Community and the people we've met on route, all through the last week, all through the past five years, to everybody we've met - it's an absolutely beautiful community.
Chernobyl radiation shield 'lost safety function' after drone strike, UN watchdog says
Chernobyl radiation shield'lost safety function' after drone strike, UN watchdog says A protective shield covering the Chernobyl nuclear reactor in Ukraine can no longer provide its main containment function following a drone strike earlier this year, according to a UN watchdog. International Atomic Energy Agency (IAEA) inspectors found that the massive structure, built over the site of the 1986 nuclear disaster, had lost its primary safety functions including the confinement capability. In February, Ukraine accused Russia of targeting the power plant - a claim the Kremlin denied. The IAEA said repairs were essential to prevent further degradation of the nuclear shelter. However environmental expert Jim Smith told the BBC: It is not something to panic about.
Drone strike on Sudan preschool by RSF and ally kills dozens of children
A drone attack by the RSF and its allied al-Hilou group on a preschool in Kalogi in Sudan has killed more than 100 people, dozens of whom were children. It sparked international condemnation amid worsening violence as the RSF fights Sudan's Armed Forces in South Kordofan state. Francesca Albanese tells Al Jazeera US sanctions have made her'non-person' Ukraine official warns of a'new law of power' after Russian aggression At Doha Forum, Qatar PM warns Gaza ceasefire is at'critical moment'